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through  August  1969  (Appendix  C). 
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Bureau  of  Standards,  is  engaged  in  a  continuing  program  to  collect  information  and  main- 
tain current  awareness  about  research  and  development  activities  in  the  field  of  information 
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AUTOMATIC  INDEXING 


A  State-of-the-Art  Report 
Mary  Elizabeth  Stevens 

A  state-of-the-art  survey  of  automatic  indexing  systems 
and  experiments  has  been  conducted  by  the  Research  Informa- 
tion Center  and  Advisory  Service  on  Information  Processing, 
Information  Technology  Division,  Institute  for  Applied  Tech- 
nology, National  Bureau  of  Standards.    Consideration  is  first 
given  to  indexes  compiled  by  or  vi^ith  the  aid  of  machines, 
including  citation  indexes.    Automatic  derivative  indexing  is 
exemplified  by  key-word-in- context  (KWIC)  and  other  word- 
in- context  technique  s  .    Advantages,  disadvantages,  and  possi- 
bilities for  modification  and  improvement  are  discussed. 
Experiments  in  automatic  assignment  indexing  are  summarized. 
Related  research  efforts  in  such  areas  as  automatic  classifi- 
cation and  categorization,  computer  use  of  thesauri,  statistical 
association  techniques,  and  linguistic  data  processing  are 
described.    A  major  question  is  that  of  evaluation,  particularly 
in  view  of  evidence  of  human  inter-indexer  inconsistency.  It 
is  concluded  that  indexes  based  on  words  extracted  from  text 
are  practical  for  many  purposes  today,  and  that  automatic 
assignment  indexing  and  classification  experiments  show 
promise  for  future  progress. 

1.  INTRODUCTION 

This  report  of  the  Research  Information  Center  and  Advisory  Service  on  Information 
1  / 

Processing  (RICASIP)  _ '  is  one  of  a  series  intended  as  contributions  to  improved  co- 
operation in  the  fields  of  information  selecti<'n  systems  development,  information  re- 
trieval research  and  mechanized  translation.    In  each  of  these  areas,  automatic  tech- 
niques for  linguistic  data  processing  are  receiving  increased  attention.     This  report 
covers  a  state-of-the-art  survey  of  current  progress  in  linguistic  data  processing  as 
related  to  the  possibilities  of  automatic  mechanized  indexing.    Insofar  as  has  been 
practical,  the  survey  of  the  literature  on  which  this  report  is  based  has  been  made 
through  February  1964. 

It  has  concentrated  on  the  major  developments  in  and  related  demonstrations  of  auto- 
matic indexing  potentialities.     Examples  are  also  given  of  indexes  compiled  by  machine 
and  of  potentially  related  research  efforts  in  such  areas  as  natural  language  text  search- 
ing, statistical  association  techniques  used  for  search  and  retrieval,  and  proposed 
systems  for  concept  processing.     There  are,  undoubtedly,  various  omissions.  Neither 
the  inclusion  of  reports  on  various  specific  experiments  and  techniques  nor  the  omission 
of  others  is  intended  to  reflect  an  endorsement  as  such  of  those  that  are  included  or  an 
adverse  evaluation  of  those  that  are  not  mentioned. 


u 

Initiated  at  the  instigation  of  the  National  Science  Foundation.  RICASIP  is  jointly 
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1.  1   Definitions  and  Background 

The  noun  "index"  has  as  its  nnost  general  meaning  "something  used  or  serving  to 
point  out,  a  sign,  token,  or  indication",  (American  College  Dictionary)  or  "that  which 
shows,  indicates,  manifests,  or  discloses;  a  token  or  indication"  (Webster's  International 
Dictionary,  2nd  Edition,  unabridged).    More  specifically,  an  index  is  "a  pointer  or  key 
which  directs  the  searcher  to  recorded  information'.'—'  The  terms  "index"  and  "indexing" 
have  been  used  in  the  fields  of  library  science  and  documentation  with  reference  to  the  fact 
that  the  selection  of  information  pertinent  to  a  particular  problem  or  interest,  from  all  the 
previously  recorded  information  available,  involves  problems  of  decision-making  based 
on  less  than  the  full  content  or  text  of  each  of  the  records  being  searched. 

Short  of  complete  scanning  of  all  the  possibly  relevant  material,  it  is  necessary  to 
select  or  "distill"  condensed  representations  or  surrogates  ^1  for  each  item.  These 
surrogates  are  intended  to  direct  the  searcher  to  the  most  probably  pertinent  items  in  a 
collection.    The  operations  known  as  "indexing"  thus  involve: 

(1)  Choosing  clues  that  will  serve  to  identify,  for  purposes  of  later  retrieval,  a 
particular  book,  document,  or  other  recorded  item,  and 

(2)  Either  marking  on  the  item  itself  or  recording  as  a  separate  item- surrogate 
the  tags,  labels,  or  codes  representing  these  clues. 

The  second  of  these  two  steps  can  be  purely  clerical  in  nature,  but  the  first  has  been, 
to  date,  primarily  the  result  of  human  intellectual  efforts  in  subject  content  analysis. 

Well-known  inadequacies  of  human  indexing  operations  include  both  those  stemming 
from  man  himself  and  those  which  result  from  the  volume  and  the  character  of  the 
materials  with  which  he  deals.    On  the  human  side,  there  are  fundamental  questions  of 
perception,  comprehension  and  judgment,  as  well  as  those  of  inter-indexer  and  even  intra- 
indexer  consistency.    In  addition,  the  indexer  is  asked  to  guess  in  advance  what  others 
will  ask  for,  understand,  and  find  relevant  on  future  search.    He  is  even  asked,  in  effect, 
to  anticipate  the  language  of  future  inquiries.    Thus,  a  somewhat  facetious  definition  of  the 
noun  "index"  has  a  considerable  sting  of  truth:    "A  system  of  analyzing  information  in 
which  the  method  used  to  choose  categories  is  carefully  hidden  from  the  user.    An  attempt 
to  outguess  the  future.  "  ^1 

The  nature  of  the  material  to  be  indexed,  especially  in  the  area  of  scientific  informa- 
tion,  raises  a  number  of  crucial  problems.    The  still  increasing  spate  of  production  of 
technical  literature  and  reports  poses  not  only  the  problems  of  sheer  volume  in  terms  of 


1/ 

Crane  and  Bernier,   1958  [  144],  p.  513. 

(Note:    Full  citations  of  references  are  given  in  the  bibliography  by  author  and  by 
numerical  order  of  the  figures  in  brackets.  ) 

2/ 

See,  for  example,  R.  E.  Wyllys,   1962  [65l],  for  discussion  of  the  two-fold  purposes 
of  condensed  representations:    to  serve  a  search-tool  function  on  the  one  hand  and 
a  content- revealing  one  on  the  other. 

Vanby,   1963  [622],  p.  143. 
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manpower  requirements  and  time  necessary  to  produce  indexes,  but  also  problems  of  glut 
in  terms  of  man-hours  necessary  for  the  individual  scientist  to  maintain  awareness  of 
what  is  going  on  in  his  field.     There  are  major  problems  created  by  newly  emerging  fields 
of  effort,  new  interdisciplinary  areas  of  interest,   and  dynamically  evolving  terminology. 
Increasing  specialization,   on  the  other  hand,  brings  out  additional  difficulties  in  finding 
what  has  been  done  elsewhere  that  might  be  applicable  to  one's  own  work  and  in  avoiding 
wasteful  duplication  of  effort,  with  their  own  attendant  problems  of  terminology. 

All  these  problems  are  aggravated  by  the  increasingly  critical  urgency  which  should 
apply  to  making  all  useful  information  available  to  those  who  need  it  as  promptly  and  as 
selectively  as  possible.     Recognition  of  this  urgency  and  of  the  inadequacies  of  present 
solutions  has  therefore  prompted  consideration  of  the  feasibility  of  using  machines  to 
assist  in  the  indexing  process. 

The  term  "mechanized  indexing"  signifies  the  accomplishment  of  some  or  all  of  the 
indexing  operations  by  mechanized  means.     The  term  includes  the  use  of  machines  to 
prepare  and  compile  indexes,  and  to  sort,  assemble,  duplicate  and  interfile  catalog  cards 
carrying  index  entries.    In  this  report,  however,  we  shall  be  concerned  primarily  with 
the  area  of  automatic  indexing,  that  is,  the  use  of  machines  to  extract  or  assign  index 
terms  without  human  intervention  once  programs  or  procedural  rules  have  been  estab- 
lished.    This  term  is  chosen  in  preference  to  auto-indexing  as  originally  suggested  by^  . 
Luhn  (I96I  [  373])  for  the  reasons  set  forth  by  Bar-Hillel,  ~^  and  to  machine  indexing  — 
due  to  possible  confusion  with  machine  tool  operations.    Automatic  indexing  has  been  used 
by  such  workers  in  the  field  as  Gardin  (1963  [209]),  Kennedy  (1962  [  310]  ),  Maron  (1961 
[395]),  Swanson  (1962  [584]),  and  Wyllys  ( 19 63  [  653] ) . 

For  obvious  reasons,  we  also  subsume  under  this  term  any  specifically  "clerical" 
(Fairthorne,   1956  [188],   1956  [189],   1961  [  190]  and  hence  machinable    operations  that 
can  similarly  be  substituted  for  human  intellectual  effort.     There  is  nothing  that  machines 
can  do  which  people  cannot  do  except  for  limitations  of  time,   cost,  or  availability  of 
appropriate  resources.    Thus,  we  shall  consider  "machine-like  indexing  by  people" 
(O'Connor,   I96I  [44?];  Montgomery  and  Swanson,   1962  [421])  as  falling  properly  within 
the  scope  of  automatic  indexing,   especially  in  the  sense  of  " .  .  .   deciding  in  a  mechanical 
way  to  which  category  (subject  or  field  of  knowledge)  a  given  document  belongs  .  .  .  decid- 
ing automatically  what  a  given  document  is  'about'.  " 

The  principle  of  indexing,  that  is,   of  using  subj ect- content  clues  and  item  surrogates 
as  substitutes  for  searches  based  on  perusal  of  the  full  contents,  has  a  history  of  several 
millenia.    In  ancient  Sumaria  and  Babylon,  clay  tablets  were  sometimes  covered  with  a 
thin  clay  envelope  or  sheath  that  was  inscribed  with  brief  descriptions  of  the  contents  of 
the  tablet  itself  (Carlson,   1963  [  lOl]  ;  Hessel,   1955  [268]  ;  Lalley    1962  [  343] ;  Olney, 
1963  [458];  Schullian,    I960  [  525]).     The  first  known  instance  of  an  index  list  is 
apparently  that  of  Callimachus  in  the  third  century  B.  C.  ,  which  was  a  guide  to  the  con- 
tents of  some  130,000  papyrus  rolls  (Olney,   1963  [  458];  Parsons,  1952[469]). 


Bar-Hillel,   1962  [35],  p.  417. 

Bohnert,  1 9  6  2  [  69]  ;  Edmunds  on,  1959  [  1 76]  ;  and  others . 
Maron,   1961  [  395],  p.  404. 
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Application  of  the  indexing  principle  by  use  of  clerical  procedures  that  today  can  be 
accomplished  by  machine  was  suggested  a  little  more  than  a  century  ago.    A  British 
librarian,  Andreas  Crestadoro,  advocated  the  permutation  of  the  words  in  titles  in  1856, 
claiming  that  thus  the  subject  matter  index  would  follow  the  author's  own  definition  of  the 
contents  of  his  book.    He  prepared  such  "concordances  of  titles"  for  several  different 
library  collections.  ]_l 

Within  a  generation,  punched  card  machines  had  been  invented,  but  they  were  not  to 
be  used  for  library  and  documentation  purposes  for  some  decades  yet.  ^1  Keppel, 
writing  in  1937  of  his  vision  of  the  library  21  years  in  the  future,  says: 

"When  it  comes  to  using  the  cards,  I  blush  to  think  for  how  many  years  we  watched 
the  so-called  business  machines  juggle  with  payrolls  and  bank  books  before  it 
occurred  to  us  that  they  might  be  adapted  to  dealing  with  library  cards  with  equal 
dexterity.    Indexing  has  become  an  entirely  new  art.    The  modern  index  is  no 
longer  bound  up  in  the  volume,  but  remains  on  cards,  and  the  modern  version  of 
the  Hollerith  machine  will  sort  out  and  photograph  anything  the  dial  tells  it  .  ..  "3^/ 

By  1945,  Bush  had  prophesied  Memex  [93],  and  in  the  1950  Windsor  lectures 
Ridenour  referred  to  an  RCA  development,  the  so-called  "electronic  pencil",  a  proposed 
reading  aid  for  the  blind  intended  to  convert  printed  characters  to  a  suitable  coded  form. 
He  went  on  to  suggest: 

"...  We  shall  have  to  arrange  for  cataloguing  to  be  done  by  machine,  without 
human  interaction  except  in  terms  of  setting  up  once  for  all  the  system  on 
which  the  cataloguing  is  performed.  .  .      It  is  only  a  step  from  this  device  (the 
electronic  pencil)  to  the  electronic  catalogue,  which  will  read  text  for  itself, 
recognize  key  symbols  and  phrases  with  which  it  has  been  provided,  and  con- 
struct appropriate  catalog  entries  for  the  text  it  reads. 

It  has  only  been  in  the  past  decade  or  so,  however,  that  there  have  been  any  serious 
efforts  directed  to  the  use  of  machines  for  automatic  indexing.    In  the  period  1957-1958, 
Liuhn  first  presented  and  published  several  provocative  papers  dealing  with  such 
challenging  possibilities  as  "auto-abstracting",  "auto-encoding"  and  "auto-indexing" 
(Luhn,   1957  [385];  1958  [374];  1959  [37l]  ).    Luhn's  work  on  the  permutation  of  signifi- 
cant words  in  titles,  abstracts,  and  complete  text,  the  Keyword-in- Context  or  KWIC 


1/ 

See  Crestadoro,  1856  [  146]  ;  see  also  Farley,  1963  [  192]  ;  Linder,  I960  [362]; 
Metcalfe,   1957  [ 41 6]  ;  and  Ohlman,   I960  [45l]  . 

See  pp. 19-22  of  this  report. 

See  Keppel,   1939  [316],  p.  5. 

See  Ridenour,  1951  [500],  p.  26. 
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system,  also  began  about  this  time.  —  Also  in  1958,  Baxendale  published  the  results  of 
experiments  in  automatic  indexing  involving  scanning  of  topic  sentences,  syntactical 
deletion  processes  and  automatic  phrase  selection  (Baxendale,   1958  [4l]  ). 


With  respect  to  the  KWIC  and  permuted  title  techniques,   several  independent 
approaches  were  being  developed  at  about  the  same  time  as  Luhn's.    These  concurrent 
efforts  were  carried  out  at  the  Wright  Air  Development  Center  (Netherwood,   1958  [437]), 
the  Rocketdyne  Division  of  North  American  Aviation  (Carlsen,  et  al,   1958  [993),  and  the 
System  Development  Corporation  (Citron,  et  al  1958  [l20];  Ohlman,   I960  [451]).^'' 
Netherwood's  permuted  title  index  to  a  bibliography  on  logical  machine  design  involves 
manual  simulation  of  a  machineable  method.    Although  the  results  were  not  published 
until  June  1958,  the  manuscript  was  submitted  in  November  1^57.—      The  Rocketdyne 
permuted-title  bibliography,  on  industrial  control,  is  credited  by  both  Henderson  (1962 
[263]  )  and  Ohlman  (I960    [451]  )  as  the  first  to  be  produced  on  computers,  the  program 


u 

In  a  private  communication  dated  March  13,  1963,  Luhn  provided  the  following 
chronology: 


2/ 


May  1957        Routine  1  Program  for  word  isolation  within  60  characters  per  card, 
written  by  H.  C.  Fallon. 

1957-1958       Creation  of  concordances  of  various  scientific  papers  in  the  form  of 

cards,  each  card  showing  a  keyword  centrally  located  within  60  letters 
worth  of  the  associated  phrase.    Experimentation  with  these  cards  to 
arrive  at  thesauri  for  special  fields  of  interest  or  study.    Idea  of  auto- 
matic indexing  by  means  of  significant  or  keywords  in  context  conceived 
by  H.  P.  Luhn. 

May  1958        Keyword-in-Context  Index  for  titles  only  initiated  by  H.   P.  Luhn  and 
samples  produced  with  Routine  1  Program. 

June  1958  Start  punching  of  titles  for  Keyword-in-Context  Index  for  literature  on 
Information  Retrieval  and  Machine  Translation.  (Keypunching  done  by 
Miss  Olive  Ferguson.  ) 

August  1958    Simplified  version  of  Routine  1  written  by  H.   C.  Fallon  for  generating 
Keywords -in-Context  Indexes  and  delivered  to  Service  Bureau 
Corporation,  New  York  City. 

September      First  Edition  of  Bibliography  and  Keyword-in-Context  Index  on 
1958  Information  Retrieval  and  Machine  Translation  published  by  Service 

Bureau  Corporation. 

January  1959  Started  writing  program  for  improved  version  of  Keyword-in-Context 

Index,  including  derived  identification  code,  written  by  Jr.  J.  Havender. 

June  1959        Second  Edition  of  Bibliography  and  Keyword-in-Context  Index  on 

Information  Retrieval  and  Machine  Translation,  published  by  Service 
Bureau  Corporation,   including  derived  identification  codes. 


See  also  National  Science  Foundation's  CR&D  Report  No.  3,  [430],  p.  39. 
y     Netherwood,  1958    [437]  ,  p.  155,  footnote. 
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having  been  written  by  J.   T.  Madigan.  —  At  any  rate,  both  this  program  and  Luhn's 
KWIC  program  at  IBM  were  apparently  written  relatively  early  in  1958. 

Citron  et  al  (1958  [  120]  )  in  presenting  results  of  the  SDC  work  and  Ohlman  in  his 
chronological  bibliography  of  permutation  indexing  (I960  [45l])cite  as  at  least  partial 
predecessors  the  "rotated  file"  principles  developed  at  the  Chemical- Biological  Coordina- 
tion Center  (1954  [  112];  Heumann  and  Dale,  1957  [270]  and  1957  [271];  Wood,  1956 
[649]  ).    It  should  also  be  noted  as  a  matter  of  historical  background  that  a  system  for 
machine  manipulation  and  compilation  of  permuted  title-and-term-index  records  has  been 
in  productive  operation  since  1952.  —I  This  earlier  effort  was  not  generally  known  to 
other  investigators  and  was  apparently  first  reported  in  the  open  literature  as  late  as  1961. 

Notwithstanding  such  other  efforts,  it  is  conceded  by  almost  all  workers  in  the  fields 
of  automatic  abstracting  and  indexing  that  the  major  credit  for  pioneering  interest  and 
impetus  should  be  attributed  to  Luhn  and  Baxendale.    Specific  acknowledgements  of  their 
"pioneering  work"  and  "first  steps"  have  been  made  by  many  investigators  both  in  this 
country  and  abrpad--for  example,  Borko  and  Bernick,  .^^Hines,        Mooers,  —1  Pevzner 
and  Styazhkin,  —    and  Wyllys.Z''ln  particular,  the  Russian  investigator  Purto  states: 
"So  far  as  we  know  H.  P.  Luhn  was  the  first  investigator  to  suggest  the  concept  of  a  set 
of  significant  words  for  the  consideration  of  problems  in  automatic  abstracting.  "  ^/ 

Much  of  the  early  effort  1957-58,  whether  at  IBM  or  elsewhere,  was  in  fact  spurred 
on  by  the  International  Conference  on  Scientific  Information  (ICSI)  held  in  Washington,  D.  C.  , 
in  November,   1958.    The  printed  text  of  both  the  Preprints  [478]  and  the  final 
Proceedings  [480,  48l]  was  deliberately  prepared,  over  the  typographer's  objections, 
so  that  a  double  space  followed  each  period  ending  a  sentence,  in  order  to  facilitate 
machine  processing  of  this  text.    Thus  the  printers  ".  .  .  .  were  faced  with  .  .  .  the 
necessity  to  prepare  the  final  volume  of  the  Proceedings  from  these  preprints,  and  to 
arrange  type  composition  amenable  to  computer  analysis.    The  latter  is  an  experiment. 
With  an  eye  to  the  distant  future,  the  Program  Committee  wished  to  make  available  the 
monotype  punched  tapes  from  the  text  for  statistical  studies  with  computers.    We  hope 


1/ 

Carlsen,  et  al,  "Information  Control",   1958  [99],  p.  20. 

2/ 

Veilleux,  1962  [  624],  p.  81:    "Consumer  demand  balanced  against  availability  of  man- 
power and  machine  time  were  the  factors  which  led  to  the  establishment  of  the  per- 
mutation title  word  indexing  project  in  1952.  " 

Borko  and  Bernick,   1962  [77]  ,   p.  3. 
4/    Hines,   1963  [273  j,  p  7. 
5_/    Mooers,   1963  L424]  ,  p.  4. 
6_/    Pevzner  and  Styazhkin,  1961  [472]  ,   p.  3. 
7/    Wyllys,  1961  l650],   pp.  6-7. 
8/    Purto,  1962  [484],  p.  2. 
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some  work  of  this  kind  will  be  demonstrated  during  the  Conference.  This  has  caused  some 
compromises  in  typography.  .  .  ^ 

Several  pioneering  experiments  in  automatic  indexing  were  applied  to  this  ICSI 
material.    One  of  these  led  to  the  preparation  of  a  permuted  keyword  index  based  on 
titles,  subtitles,   section  and  table  headings,  figure  captions,   and  selected  sentences  or 
phrases  taken  directly  from  the  text  (Citron,   et  al,   1958  [  120]  ).    It  was  prepared  using 
punched  card  equipment,  and  the  resulting  listings  were  distributed  to  the  Conference 
participants  in  November  of  1958.    Another  set  of  experiments  involved  trial  of  the  "auto- 
abstracting"  and  "auto-encoding"  techniques  proposed  by  Luhn  (1958  [  379]  )  ■  3-^  A 
computer  program  potentially  applicable  to  certain  ancillary  operations  which  might  be 
involved  in  automatic  indexing  was  also  demonstrated  at  the  time  of  the  ICSI  sessions. 
(Stevens,   1959  [568]  ). 

Much  of  the  rapidly  proliferating  work  in  the  field  of  automatic  indexing  since  that 
time  has  been  inspired  directly  or  indirectly  by  the  results  of  these  experiments  using 
the  ICSI  material.    For  example,  Dowell  and  Marshall,  discussing  early  efforts  at  the 
English  Electric  Company,   state:    "We  first  became  interested  in  the  possibilities  of 
computer  produced  indexes  through  Luhn's  work  at  IBM  and  the  early  examples  of  KWIC 
indexes  which  were  distributed  at  the  time  of  the  Washington  Conference.  .  .  "  (Dowell 
and  Marshall,   1962  [  1  59]  ) 


y 

"Preprints  of  papers  of  the  International  Conference  on  Scientific  Information,  " 
1958,  [478],  Preface.    (The  monotype  tapes  are  in  fact  still  held  in  the  custody  of 
the  Research  Information  Center  and  Advisory  Service  on  Information  Processing, 
National  Bureau  of  Standards,  but  difficulties  to  be  discussed  later  in  this  report 
discourage  their  use.  ) 

See  also  his  "Automated  intelligence  systems"      19  62  [  372],  note.  1 1 ,  p.  100: 
"Papers  for  this  conference  were  distributed  to  participants  two  months  ahead  for 
study.    By  arrangement  with  the  Columbia  University  Press  the  Monotype  tapes  used 
in  publishing  these  preprints  were  made  available  for  experimentation.    At  the 
conference  exhibit,  IBM  researchers  demonstrated  the  automatic  transcription  of 
these  Monotype  tapes  to  magnetic  tape  via  punched  cards  and  thence  the  automatic 
creation  and  printout  of  abstracts  by  means  of  electronic  data  processing  equipment 
at  the  Space  Systems  Center  in  Washington,  D.  C.    All  this  was  done  without  any 
human  intervention  except  for  the  handling  of  the  input  and  output  records.  Also, 
preprinted  Auto-Abstracts  of  Papers  of  Area  5  of  the  Conference  were  made  avail- 
able to  participants  at  the  beginning  of  the  conference.  " 

3/ 

See  also  R.  A.  Kennedy,   1962  [310],  p.   181:    "While  automatic  indexing  in  any 
interpretative  and  analytical  sense  is  therefore  not  yet  a  practical  matter,  a 
simpler  mode  of  machine  indexing  is  coming  into  wide  use  .  .  .  primarily 
stimulated  by  the  publication  in  1958  and  1959  of  reports  by  Ohlman,  Hart  and 
Citron  and  Luhn.  " 
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A  somewhat  premature  attempt  was  made  to  establish  a  subscription  service  for 
KWIC  indexes  for  a  number  of  journals,  for  initial  distribution  beginning  January  1, 
1959.  }J  Called  PILOT  (Permutation  Indexed  Literature  Of  Technology),  the  proposed 
service  was  advertized  as    "a  revolutionary  new  totally  cross-referenced  index  .  .  .  and  it 
will  be  produced  at  the  speed  of  light".    Figure  1  is  a  reproduction  of  a  part  of  the 
brochure  issued  in  1958  by  Permutation  Indexing,  Incorporated,  Sol  Grossman,  President, 
Los  Angeles.    While,  perhaps  unfortunately,  the  number  of  subscription  orders  received 
was  not  adequate  in  terms  of  the  ambitious  coverage  planned,  work  on  permuted  title 
indexing  elsewhere  did  lead  rapidly  to  the  publication  of  such  indexes  on  a  production 
basis . 

As  of  February  1964,  there  are  more  than  40  examples  of  KWIC  and  other  variations 
of  permuted  keyword  indexing  techniques  in  productive  operation  or  available  to  the 
searcher.    KWIC-type  techniques  have  also  been  extended  to  special  one-time  index  com- 
pilations and  other  applications,  as  in  "automated  content  analysis"  of  verbal  protocols  of 
psychiatric  interviews  and  group  leadership  training  sessions  (Ford,   1963  [l98];  Hart  and 
Bach,   1959  [  256];  Jaffe      1962  [  294]  and  1958  [  296]  ;  Stone,   et  al.   1962  [  575]). 

The  same  period  during  which  the  ICSI  was  planned  and  held  (1957-1958)  was  also 
marked  by  the  first  issue  of  Current  Research  and  Development  in  Scientific  Documenta- 
tion by  the  National  Science  Foundation.    In  it  and  in  subsequent  issues,  there  were 
reported  other  early  efforts  in  machine-compiled  indexes,  in  the  construction  and  use  of 
special  thesauri,  and  in  indexing  and  retrieval  experiments  based  on  machine  processing 
of  text.    Thus,  for  example,  punched  card  methods  for  compiling  printed  indexes  and 
announcement  lists  were  under  consideration  at  Bell  Laboratories  and  at  Esso  Research 
and  Engineering.    Special  attention  was  being  given  to  thesauri  as  early  as  July  1957  at 
both  Chemical  Abstracts  Service  and  the  Cambridge  Language  Research  Unit,  and 
at  Ramo  Wooldridge,   "Research  on  the  problems  of  fully  automatic  indexing  and  retrieval 
based  on  raw  text  input  to  a  general-purpose  computer  is  under  way.  "?_/ 

Nevertheless,  as  of  the  present  date,  the  question  of  the  possibility  of  automatic 
indexing  in  the  sense  of  the  substitution  of  machineable  procedures  for  human  intellectual 
efforts  normally  required  to  identify,  categorize,  classify,  index,   select,  and  list 
particular  items  in  a  collection  of  items  is  still  moot.    Opinions  run  the  gamut  from 
extreme  pessimism,  "Mechanization  of  abstracting  and  indexing  is  rejected  as  impracti- 
cal for  the  foreseeable  future"^to  enthusiastic  optimism,  "The  conclusion  that  automatic 
indexing  and  cataloging  is  superior  to  human  indexing  and  cataloging  is  both  provocative 
and  remarkable.  "  1/ 

Borko  and  Bernick  claim  that  "...  Raw  data,  i.  e.  ,  unedited  natural  language  text, 
can  be  processed  statistically  so  as  to  automatically  assign  index  terms  to  each  document 
and  to  classify  the  document  into  a  subject  category;  this  has  been  demonstrated.  "   \1  On 
the  other  hand,  Farradane  thinks  that  any  form  of  mechanized  processing  in  indexing 


1/ 

See  Linder,   I960  [  363],  p.  99  and  Figure  1. 

National  Science  Foundation' s  CR&D  Reports  No.  1,  [430]pp.  4,  6;  No.  3  [430]  , 
pp.  12,  19,  31. 

3/ 

—     Bar-Hillel,  1958  [33]  ,  abstract. 

1/     Swanson,  1962    [584]  ,  p.  468. 

5_/     Borko  and  Bernick,  1963  [  78]  ,  p.  28. 
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THE  PRINCIPLE  OF  PILOrS  PERMUTATION  INDEXING  CAN  BE  DEMONSTRATED  BY  ONE  SAMPLE  TITLE: 
TITLE  OFARTIOE 


Impulse  Type  .  .  .  etc. 

(mpulse  Voltage  Circuit  For  Use  With  Recurrent  Surge  Oscillators. 
Input  .  .  .  etc. 

On  mother  page  of  the  index  the  entry  would  appear  as  follows: 

Centigram  Measu  .  .  .  etc. 

Grcuit  For  Use  With  Recurrent  Surge  Oscillators.  Impulse  Voltage 
Crcuit  .  .  .  etc. 

And  on  sHIl  another  poge  of  PILOT: 

Oscillators.  Impulse  Voltage  Circuit  for  Use  With  Recurrent  Surge 
Oscillograph  Magaz  .  .  .  etc. 
Oscillograph  .  .  .  etc. 


PERIODICAL 

IBM  Journal  Vol.  6  No.  3 


IBM  Journal  Vol.  6  No.  3 


IBM  Journal  Vol.  6  No.  3 


Similarly,  the  oti 
will  appear  i 


of  the  some  title,  as  welt  as  the  Journal  title, 
and  sorted  In  the  left  hand  (index)  position. 

lotes  will  bo  keyed  to  each  PILOT  Issue 
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Figure  1.    Brochure  for  Proposed  Permuted  Title  Index  Service 
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operations  is  "liable  to  continuous  error",  —    while  Baxendale  takes  a  middle  ground: 
"Thus  far  the  role  of  the  computer  is  chiefly  that  of  research  instrument;  whether  or  not 
it  can  fully  assume  the  task  of  indexing  is  still  in  doubt". 

1 .  2  Scope  of  This  Study 

In  view  of  the  continuing  controversy  over  the  feasibility  and  evaluation  of  automatic 
indexing  techniques,  a  state-of-the-art  survey  and  report  is  perhaps  premature  at  this 
time.    The  topic  is  controversial  on  at  least  five  grounds:    First  is  the  question,  "Can 
indexing  be  done  by  machine  at  all?"    Next, "Is  what  can  be  done  by  machine  properly 
termed  'abstracting',   'indexing',  or  'classifying'?"    The  third  moot  point  is  "Is  whatever 
can  be  done  by  machine  good  enough,  acceptable,  as  good  as,  or  better  than  the  product 
of  human  operations?"    The  fourth  and  most  critical  question  is  "How  can  we  evaluate 
acceptability  or  comparability  for  any  indexing  process  whatsoever,  whether  carried  out 
by  man  or  by  machine  or  by  machine-aided  manual  operations?"    Finally,  "If  an  indexing 
product  is  to  be  achieved  by  machine,  can  it  be  done  by  statistical  means  alone,  or  must 
syntactic,   semantic  and  pragmatic  considerations  be  brought  to  bear  in  the  machine 
decision-making  processes?" 

The  heat  of  controversy  over  any  of  these  five  grounds  of  debate  is  almost  inversely 
related  to  the  availability  of  objectively  validated  evidence  to  which  appeal  might  be  made 
Thus,  the  literature  on  the  topic  to  date  is  typically  colored  by  personal  reactions  both 
pro  and  con,  and  even  the  cynics  rely  more  on  subjective  judgments  and  personal  pre- 
ferences than  on  any  substantial  body  of  data.    O'Connor  cites  typical  claims  of  both  pro- 
ponents and  opponents  of  the  feasibility  of  automatic  indexing,  and  he  comments  on  both, 
"I  have  seen  no  good  evidence  offered  in  support  of  such  a  conclusion.  "  ^/ 

An  impartial  middle  ground  is  offered  by  recognition  that  "To  define  a  process 
ordinarily  thought  to  require  human  intellectual  effort  in  such  a  way  that  it  can  be  per- 
formed by  a  machine  imposes  a  jigor  and  a  discipline  on  the  definition  which  itself  is  in- 
valuable to  understanding  the  nature  of  the  process"-^^  Learning  more  about  the  indexing 
process  itself,  through  experimentation  with  machines,  will  provide  "results  of  general 
interest,  not  just  to  those  optimistic  about  machine  indexing  experiments".  ^/    In  this 
sense,  a  state-of-the-art  study  is  not  premature.    In  this  sense,  therefore,  we  shall 
explore  the  five  questions  listed  above  in  subsequent  sections  of  this  report. 


1/ 

Farradane,  1961,  [193],  p.  236. 

2/ 

Baxendale,  1962  [42],  p.  69. 
O'Connor,   1961  [447],  pp  .274  and  275. 
Swanson,   1962  [583],  p.  288. 
Bohnert,   1962  [69],  p.  9. 
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More  particularly,  in  this  survey  of  automatic  indexing  efforts,  we  will  be  concerned 
with  the  following  principal  topics: 

(1)  A  brief  indication  of  the  variety  of  ways  in  which  punched  card  machines  and 
computers  can  be  and  have  been  used  in  the  preparation  or  compilation  of 
indexes.  ]J 

(2)  A  more  detailed  consideration  of  the  possibilities  for  machine  generation  of 
indexes,   specifically  including: 

(a)  Automatic  derivative  indexing,  as  in  various  examples  of  machine 
extraction  of  keywords,  where  selection  is  based  upon  pre- specified 
criteria, 

(b)  Automatic  assignment  indexing,  whereby  the  machine  is  programmed  to 
determine,  in  accordance  with  various  specified  criteria,  whether  or 
not  some  one  or  more  members  of  an  established  list  of  'labels'  (such 
as  subject  headings,  class  names,  descriptors,  or  other  indexing  terms) 
shovild  appropriately  be  assigned  to  the  document  or  item  in  question,  and 

(c)  Automatic  classification  techniques,  on  which  such  assignment-indexing 
operations  may  or  may  not  be  based. 

(3)  Consideration  of  the  use  of  machines  as  relatively  sophisticated  aids  to  human 
intellectual  operations  applied  in  either  subj ect- content  analyses  or  search- 
strategy  determinations. 

(4)  Discussion  of  the  question  of  evaluation  of  any  index  whatever,  whether 
manually  or  mechanically  prepared. 

(5)  Consideration  of  the  implications  of  related  research  and  development  efforts, 
specifically  including: 

(a)  Comparative  evaluation  of  indexing  systems, 

(b)  Development  and  use  of  new  types  of  "indexing"  aids  (in  the  sense  of 
"pointing  to"  and  "indicative  of"  the  probable  subj  ect -content  relevance) 
to  either  selective  dissemination  or  retrospective  search  of  the  technical 
literature, 

(c)  Linguistic  and  logical -inference  approaches  to  the  elucidation  of  'meaning' 
in  natural-language  messages,  and 

(d)  Theoretical  approaches  to  the  problems  of  determining  "membership-in- 
classes" . 


Note  that  card- controlled  camera  systems,   such  as  the  Listomatic,  and  Addresso- 
graph  machines  have  also  been  used  for  index  compilations.    See,  for  example,  Shaw, 
1951  [  542]  ,  p.  49,  who  cites  early  use  of  the  Addr es sograph  for  bibliographical  work 
by  A.  Predeek,  "Die  Adrema-Maschine  als  Organizationsmittel  im  Bibliotheks- 
betriebe",  Berlin,   1930,  and  E.  Morel,  "Les  Machines  au  secours  de  la  Biblio - 
graphie".  Revue  du  Livre  1:14-19  (1933)  Use  of  such  devices  is  not  included  in  this 
report,  however,   since  they  cannot  be  adapted  to  machine  generation  of  indexes. 
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(6)    Appraisal  of  the  current  prospects  for  further  research  and  development. 

Certain  difficulties  of  organization  are  evident.     Thus  many  proposals  precede  actual 
tests  of  techniques  to  which  they  are  akin.    Other  proposals  have  been  engendered  as  by- 
products of  or  incidental  to  investigations  of  other  techniques,  such  as  those  of  text  pro- 
cessing to  derive  by  machine  selected  sentences  which  together  may  serve  as  automati- 
cally generated  "abstracts",  more  properly  extracts.  ]_l 

This  related  subject  of  automatic  abstracting,  i.  e.  ,  the  application  of  machine- 
usable  rules  to  the  extraction  or  generation  of  textual  information  representing  in  con- 
densed form  that  carried  in  the  document  as  a  whole,  will  not  be  of  primary  concern. 
However,  it  will  be  noted  that  most  of  the  automatic  abstracting  techniques  so  far  pro- 
posed are  potentially  usable  as  tools  for  automatic  indexing,  especially  in  the  trivial 
sense  that  the  automatic  selection  of  index  terms  coxild  be  based  solely  upon  the  substan- 
tive words  found  in  the  machine-prepared  extract.  —    Further,  since  we  are  presuming 
that  a  state-of-the-art  review  of  automatic  indexing  techniques  is  in  some  sense  appro- 
priate at  this  time,  we  shall  emphasize  the  actual  results  of  machine  compilation  and 
machine  generation  of  indexes  and  those  investigations  of  assignment-indexing  techniques 
for  which  experimental  or  comparative  data  have  been  reported,  rather  than  theoretical 
approaches . 


See,  for  example,  Luhn,  1959  [384],  p.  4:    "The  principle  of  abstracting  in- 
formation by  extracting  certain  portions  or  elements  from  the  full  text  of  a 
document  is  particularly  suitable  to  mechanization";  Becker,   I960  [44],  p.  13: 
"Perhaps  'extracting'  would  have  been  a  better  word  than  'abstracting'";  Edmundson 
and  Wyllys,   1961,  [l8l],  p.  227:    "All  proposed  methods  for  making  an  automatic 
abstract  of  a  document  involve  using  the  author's  own  words  by  selecting  complete 
sentences,  thereby  reducing  abstraction  to  the  simple  task  of  extraction.  " 

See  Wyllys,    1963    [653  J,    p.  22:  "Automatic  indexing  is  an  area  that  seems 

to  us  to  be  especially  close  to  automatic  abstracting,  since  the  words  and  word 
groups  found  to  be  most  representative  of  a  document  for  automatic  abstracting 
purposes  are  obvious  candidates  for  entries  in  an  automatic  index  for  the 
documents."    See  also  Tanimoto,    196!  [594  ],  p.  235:     "Thus  after  ex- 
tracting k  sentences  which  are  a  predetermined  small  fraction  of  the  document, 
we  have  an  'abstract'.    To  find  the  indexes  to  the  document  we  take  these  k 
sentences  and  the  corresponding  sets  of  the  canonical  elements  and  consider 
terms  versus  sentences  instead  of  sentences  versus  terms.  .  .  The  same  analysis 
is  then  applied  to  this  'transposed'  problem  to  produce  the  index  terms";  Yakushin, 
1963  [  654],  p.  17:    "If  some  method  can  be  employed  for  the  automatic  compilation  of 
abstracts,  it  can  as  well  be  used  for  the  subject  index.  " 
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1.3  Derivative  vs.  Assignment  Indexing 


At  least  part  of  the  provocation  and  controversy  with  respect  to  the  possibilities  for 
the  use  of  machines  in  indexing  is  due  to  confusion  as  to  what  type  of  indexing  is  meant. 
This  in  turn  relates  to  a  much  older  and  broader  controversy- -that  between  "word"  or 
"catchword"  indexing  on  the  one  hand  and  "subject  indexing",   "concept  indexing",  or 
"controlled  indexing"  on  the  other. 

In  terms  of  operational  definition,  the  contrast  is  best  expressed  in  Luhn's  dis- 
tinction between  index  entries  that  are  derived  from  the  text  of  an  item  itself  and  those 
that  are  assigned  to  it  from  a  list  or  schedule  of  subject  categories,  descriptors  and  the 
like,  which  exists  independently  of  the  text  of  the  item  (Luhn,   1962  [  372]  ).        In  general, 
the  differentiations  that  are  made  for  the  broader  controversy,  and  the  claims  and 
coTinter- claims  made  by  the  enthusiasts  of  either  school,  provide  background  for  the 
distinctions  that  should  be  made  between  various  automatic  derivative  indexing  operations 
and  whatever  possibilities  may  be  demonstrated  for  assignment  indexing  by  machine. 

In  his  text  on  information  storage  and  retrieval  Kent  (1962  [315]   )  contrasts  word  index- 
ing as  used  in  permuted  ke'>^'~word  indexes,  concordances  and  "pure"  Uniterm  systems  with 
controlled  indexing  which  "implies  a  careful  selection  of  terminology  used  in  indexes  in 
order  to  avoid,  as  far  as  possible,  the  scattering  of  related  subjects  under  different 
headings.  "    He  notes  elsewhere  that  word  indexing  requires  little  subject-matter  training 
on  the  part  of  the  indexer  and  little  skill  in  indexing  as  such,  and  adds:    "It  is  this  type  of 
indexiing  that  a  machine  can  perform  well."^'' 

Like  Kent,   Bernier  thinks  that  true  subject  or  assignment  indexing  requires  highly 
trained  human  indexers.    He  says  further: 

"The  difference  between  subject  and  word  indexing  has  been  unclear  at  times. 
Both  types  employ  words,  but  only  true  subject  indexing  employs  them  with 
discrimination.    Word  indexing  leads  to  omission  of  entries,   scattering  of  re- 
lated information,   and  a  flood  of  unnecessary  entries.     Word  indexing  uses 
words  as  they  are  found  in  the  material  indexed  with  a  minimum  regard  for 
standardized  meaning.  .  .  "  A/ 

Herner  provides  a  further  amplification  of  differences  that  are  pertinent  to  con- 
sideration of  indexing  by  machine,  as  follows: 


y 

See  also  Herner,   1962  [266],  p.  5;  Skaggs  and  Spangle r,   1963,  [  557],  p.  60;  Slamecka, 
1963  [558],  p.  224.    Mooers  makes  a  similar  distinction  between  "index  terms  which 
are  words  or  phrases  extracted  from  the  text  and  stylized  conceptual  terms--cliches 
--which  are  assigned  to  the  text",  1963  [423]  ,  p.  4. 

Kent,  1962  [314],  p.  268. 

3/ 

Bernier,   1956  [54],  p.  23. 

1/ 

Herner,   1963  [267],  p.  183. 
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"The  differentiation  that  is  made  between  the  two  types  of  indexing  is  that  word 
indexing  is  inextricably  tied  to  the  words  in  a  text:    If  a  word  appears  it  gets 
indexed  as  such;  if  it  does  not  appear  it  does  not  get  indexed.    Concept  index- 
ing, on  the  other  hand,  has  an  element  of  abstraction  in  it:    Words  may  either 
be  indexed  as  such  or  may  be  converted,  either  by  themselves  or  in  combination 
with  other  words,  into  concepts  which  may  not  bear  a  direct  resemblance  to 
the  words  or  combinations  of  words  that  evoked  them  in  the  indexer's  mind." 

Machine  techniques  such  as  those  of  Luhn's  KWIC,  like  the  early  Uniterm  systems, 
look  no  farther  than  the  words  used  by  the  one  author  himself.     Techniques  such  as  those 
of  Maron,  Swanson,  Borko,  Meadow  and  Williams,  among  others,  look  specifically  to 
relationships  between  words  as  used  by  one  author  to  patterns  of  word  usages  in  a  given 
subject  area  or  given  document  collection.    They  may  also  look  to  these  patterns  as  in 
turn  related  to  prior  human  analytic  judgments  of  the  "aboutness"  referrents  of  items  in 
the  collection.    In  this  sense,  they  at  least  attempt  replication  by  machine  of  assignment 
indexing. 

There  is  no  real  question  but  that  machines  can  in  fact  derive  words  from  text  pro- 
vided that  it  is  in  machine -readable  form.    This  machine  procedure  may  involve  direct 
extraction  of  all  words  as  index  entries,  as  in  a  complete  concordance.    It  may  involve 
the  extraction  of  only  those  words  which  survive  a  "purging"  operation  in  which  articles, 
conjunctions,  adjectives,  and  other  "common"  words  are  first  deleted.    Various  machine- 
controlled  modifications  to  such  "derivative"  indexing  are  also  available.    The  case  for 
machine  achievement  of  assignment  indexing  for  any  but  limited  special  cases  is  not  so 
clear. 

2.    INDEXES  COMPILED  BY  MACHINE 

A  first  and  obvious  use  of  machines  in  indexing  processes  is  in  the  manipulation  of 
index  entries,  previously  selected  on  the  basis  of  human  analysis,  to  produce  various 
orderings,  duplications  and  listings  of  these  entries.    The  power  of  machine  techniques 
to  speed  and  economize  the  sorting,  ordering  and  listing  operations  in  the  preparation 
or  compilation  of  indexes  was  recognized  quite  early,  both  in  the  field  of  library  science 
and  in  the  consideration  of  potential  areas  of  application  by  specialists  in  machine 
potentialities . 

In  particular,  two  specialized  types  of  index,  at  least  in  the  broad  sense,  are  such 
that  their  compilation  would  be  almost  prohibitive  in  terms  of  time  and  cost  were  it  not 
for  the  use  of  machines.  These  are,  respectively,  the  case  of  the  complete  index,  the 
index  to  all  words  of  a  text  in  their  various  contexts,  which  is  a  concordance,  and  the 
case  of  the  "citation  index",  which  has  been  used  in  the  field  of  law  for  many  years  but 
has  only  quite  recently  been  suggested  for  literature  search  purposes  related  to 
scientific  and  technical  information. 


u 

See,  for  example.  Doyle, 1963    [162],  p.  11:    "Without  data -processing 
machinery,  concordances  are  prohibitively  expensive  to  generate  for  most  uses 
except  in  those  cases  where  it  is  well  known  that  a  given  volume  of  text  is  going 
to  be  used  again  and  again,  by  large  numbers  of  people  over  a  long  period  of 
time.    As  we  know,   clergymen  have  made  use  of  manually  prepared  concordances 
of  the  Bible  since  the  12th  century". 
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In  machine- compiled  indexes,  no  item  or  entries  are  eliminated  by  the  machine, 
whereas  in  even  the  most  rudimentary  of  machine-generated  indexes,   such  as  KWIC, 
various  reductive  or  extractive  operations  are  automatically  applied  as  a  part  of  the 
machine  procedure.    We  shall  be  concerned  in  this  section  with  brief  discussions  of 
machine-compiled  indexes  and  related  devices,   specifically,  concordances,   card  or  book 
catalogs  mechanically  prepared,  citation  indexes,  and  special  indexes  such  as  Tabledex. 
The  use  of  machines  to  compile,   sort,   duplicate  and  list  index  entries  can  only  be  con- 
sidered to  be  mechanized  indexing  in  a  relatively  trivial  sense.    We  shall  consider,  there- 
fore,  only  a  few  representative  examples,   emphasizing  early  work  and  some  of  the 
pioneering  instances. 

2.  1  Concordances  and  Complete  Text  Processing 

When  as  early  as  1856,  Cxestadoro  proposed  the  use  of  permutations  of  the  words  in 
titles  as  a  subject- content  index  the  only  "machines"  available  for  the  processing  opera- 
tions were  people  acting  in  a  strictly  clerical  way.    Precisely  such  clerical  operations 
have  been  used  for  centuries  in  a  process  that  is,  in  the  special  sense  of  full  representa- 
tion of  document  contents,  an  index-producing  operation- -the  making  of  concordances.  i_/ 
The  task  of  listing  each  separate  word  in  a  book  in  all  the  contexts  in  which  it  appears 
is  incredibly  time-consuming  and  tedious  when  carried  out  by  manual  means.     There  are 
those  who  have  spent  the  major  part  of  their  lifetimes  at  this  task.    For  example:  "It 
took  James  Strong  thirty  years  to  compile  his  exhaustive  Concordance  of  the  Bible.  .  .  "  — 
The  use  of  machines  capable  of  processing  signals  which  represent  and  preserve  in- 
formation offered  a  potentially  revolutionary  change,  and  with  the  advent  of  the  electronic 
computer  even  more  radical  possibilities  of  very  high  speed  processing  were  opened  up. 

As  early  as  1949,  J.  W.  Mauchly  (the  co-inventor  of  ENIAC  and  UNIVAC)  envisioned 
the  use  of  computers  for  documentation  and  library  science  activities.    He  suggested  that 
the  full  information  contents  of  the  Library  of  Congress  collections  could  be  recorded  in 
machine  language,  stored  in  this  form  on  magnetic  tape,  and  searched  by  machine  in  a 
procedure  which  would  match  words  or  other  selection  indicia  occurring  in  the  recorded 
information  to  the  specified  words  or  selection  criteria  of  a  query  or  search  prescription. 
Specifically,  he  estimated  that  the  entire  collection,  then  amounting  to  10,  000,  000  books, 
could  when  transcribed  to  binary-code  representation  ^1  be  serially  searched  in  20 
hours .  ^1 

17 

See,  for  example.  Black,   19^2  [  65],  p.  314:    "The  oldest  book  in  the  world  has  had 
such  an  index  for  many  years--the  concordatnce  to  the  Bible;"  Markus,   1962  [394], 
p.  19:    "The  ultimate  in  permutation  for  indexing  is  a  published  concordance;"  Linder, 
I960  [  363]  ,  p.  99:    "We  know  of  a  concordance  prepared  in  the  13th  Century;" 
Simmons  and  McConlogue,    1962  [  555]  ,  p.  3:    "Complete  indexing  has  been  used  of 
course  for  centuries  in  the  preparation  of  concordances.  " 

Carlson,   1963  [lOl],  p. 211. 

1/ 

That  is,  markings  which  have  one  of  two  values  (thus,  binary  digits  or  "bits"),  can 
be  used  to  distinguish  between  2^  different  other  symbols  such  as  alphabetic 
characters  by  using  log  2^  of  such  markings.    A  binary  code  for  the  26  letters  of  the 
English  alphabet  requires  a  five-bit  representation  for  each  letter.    If  numeric  digit 
characters  are  also  recorded,  (26+10),  a  six-bit  code  representation  is  required. 

4/ 

Mauchly,   1949  [406],  p.  295.    See  also  "Report  to  the  Secretary  of  Commerce  on  the 
application  of  machines.  .  .  "  1954  [620],  p.  67. 
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Mauchly's  suggestion  was,  in  effect,  the  idea  of  a  complete  index  that  could  be 
searched  by  machine.    We  should  note,  however,  that  although  subsequent  technological 
advances  could  significantly  decrease  his  original  time  estimate,  the  crucial  questions 
that  remain  are  those  of  what,  assuming  one-to-one  representation  of  document  text,  one 
would  search  for.  ]_l  Natural  language  searching  by  machine,  in  the  sense  of  full  text 
inspection,  is  a  "pay-as-you-go"  concordance  technique.    It  is,  however,  a  technique 
which  must  be  aided  and  abetted  by  various  forms  of  synonym  reduction,  syntactic 
normalization,  homograph  resolution  and  other  special  processing  operations  if  it  is  to  be 
in  any  sense  an  effective  tool  for  selection  of  clues  to  be  retrieved. 

Gardin,  in  a  series  of  recent  lectures  on  automatic  documentation,  (Gardin,  1963 
[207,  208]         refers  to  the  opinions  of  some  investigators  that  it  should  be  possible  to 
"jump"  the  stage  of  indexing  and  to  search  the  natural  language  texts  directly.  The 
problem,  he  points  out,  then  shifts  to  the  determination  of  all  the  various  ways  in  which 
the  possible  answers  to  a  question  may  have  been  expressed  in  these  natural  language 
"complete  indexes".    Instead  of  carrying  out  reductions  or  condensations  of  the  documents, 
as  in  normal  indexing  procedures,  amplifications  of  questions  are  required.  "Reductive" 
indexing  of  the  source  documents  can  only  be  eliminated  at  the  expense  of  "expansive" 
indexing  of  questions.    Gardin  concludes  that  the  gain  from  this  is  very  doubtful. 

There  is  also  the  presently  staggering  burden  of  time  and  cost  to  convert  full  texts  to 
machine-usable  form.    As  of  February,   1961,  it  was  estimated  that  the  natural  language 
text  material  available  for  machine  processing  amounted  to  little  more  than  the  words 
contained  in  the  Harvard  Classics  five-foot  shelf  (Stevens,    1962  [567]  ).    Perhaps  up  to 
ten  times  that  amount  is  now  available,  notably  in  the  6,  000,  000  words  of  the  statutes  of 
Pennsylvania  ^1  and  in  several  million  additional  words  that  have  since  been  keypunched 
at  the  Center  for  Automation  of  Literature  Analysis,  Gallarate,  Italy.  4/  a  very  recently 


1/ 

See,  for  example,  Yngve,   1959  [  657],  pp.978-979:    "We  will  have  to  find  formal 
connections  between  widely  divergent  ways  of  saying  essentially  the  same  thing.  In 
addition  there  is  much  that  we  will  have  to  learn  about  searching.    If  we  had  today  a 
complete  grammar  of  English  which  was  capable  of  rendering  explicit  all  the  relations 
and  distinctions  implicit  in  the  document,  I  doubt  that  we  would  know  how  to  use  it 
effectively  in  a  machine  search  situation.    We  would  be  embarrassed  by  the  very 
wealth  of  the  information  available.    Much  more  must  be  learned  about  search 
situations.  " 

2/ 

See  also  Bar-Hillel,   1962  [  35]  ,  p.  415:    "Could  not  the  stage  of  clue  assignment  be 
completely  skipped  and  the  request  topic  be  directly  compared  with  the  original 
documents?    It  is  very  natural  that  such  a  thought  should  have  arisen,  but  it  must 
be  stressed  that  there  is  nothing  in  our  knowledge  of  the  workings  of  communication 
which  would  indicate  that  such  a  proposal  is,  or  ever  will  be,  practical.  " 

11 

See  various  references  by  J.  F.  Horty,  W.  B.  Eldridge  and  S.  F.  Dennis,  E.  M.  Fels, 
R.  Wilson. 

4/ 

R.  Busa,  data  reported  at  the  NATO  Advanced  Study  Institute  on  Automatic  Docu- 
ment Analysis,  Venice,  July  1963. 
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completed  study  made  by  theTRW  Computer  Division,  Thompson  Ramo  Wooldridge, 
involves  the  investigation  of  the  possibilities  for  a  center  to  provide  text  in  machine- 
usable  form.     The  report  gives  a  total  figure  of  approximately  50,000,000  words  of  text 
so  available  as  of  February  28,   1964,  but  this  includes  non-scientific  text,  such  as  news- 
paper and  popular  magazine  materials  (Mersel  and  Smith,   1964  [415]  ). 

Mersel  and  Smith  also  report  on  the  estimated  requirements  for  machine -usable  text 
for  various  research  groups,  averaging  over  a  million  words  per  year  per  group.    Yet,  at 
present  keypunching  costs  of  one  cent  or  more  per  word,  is  it  reasonable  to  assume  that 
any  of  these  research  groups  can  provide  a  budget  of  over  $100,  000  per  year  for  this 
purpose  alone?    Moreover,  this  budget  would  provide  for  the  conversion  of  no  more  than 
a  thousand  1,  000-word  items. or  a  hundred  10,  000-word  items  at  costs,  respectively,  of 
$100  or  $1,000  per  item.    For  the  present,  therefore,  the  conclusion  is  ineacapable:  either 
indexing  or  search  based  upon  full  text  processing  is  not  yet  practical.    Even  the  most 
enthusiastic  proponents  of  "searching  full  natural  language  text"  (Swanson,   I960  [  589]) 
and  "maximum- depth  indexing"(Simmons  and  McConlogue,   1962  [  555]  )  generally  agree  as 
to  the  present  impracticality  of  full-text  mechanized  indexing  except  for  special  limited 
cases . 

The  two  problems  of  determining  what  to  search  for,  given  full  text,  and  of  feasibility 
of  conversion  of  text  into  machine -usable  form  thus  combine  to  limit  "complete  indexing" 
largely  to  the  special  cases  of  providing  corpora  for  studies  in  the  field  of  computational 
linguistics  and  of  compiling  the  traditional  scholarly  tool- -the  concordance  to  all  the  words 
in  a  given  literary  work  or  works.    Apparent  exceptions,  including  experimental  work 
with  abstracts  only  and  the  law  statutes  studies,  are  usually  cases  in  which  the  selective 
principle  of  disregarding  common  words  (and  hence  the  bulk  of  the  actual  text)  is  applied 
automatically  either  on  input  or  in  subsequent  processing  {Cleverdon  and  Mills,  1963 
[I3l]  ).    These  cases,  therefore,  may  be  considered  machine-generated  indexes  rather 
than  machine -compiled.    Moreover,  it  should  be  noted  that: 

"  The  law,  itself,  is  an  appropriate  field  for  data  retrieval.    The  statutes, 

especially,  are  written  in  relatively  clear,  concise  language.    At  least,  this 
is  their  intent.    Practically,  this  means  that  input  and  output  can  both  be 
relatively  short  and  that  retrieval  of  legal  information  will  be  involved  with 
fewer  semantic  difficulties.  "  ]J 

In  the  area  of  concordance-making,  however,  the  potentialities  of  machine  com- 
pilation have  been  put  to  good  use.    The  pioneer  efforts  in  this  area  are  unquestionably 
those  of  Father  Roberto  Busa,  S.J.  ,  of  the  Gallarate  Center.    As  early  as  1946,  Busa 
proposed  to  his  superiors  that  a  card  file  recording  all  the  words  used  in  all  of  the  works 
of  St.  Thomas  Aquinas  should  be  set  up,  and  he  began  his  actual  experiments  using  IBM 
punched  card  equipment  in  1949  (Busa,   1953  [87],   1960  [9l],  and  1958  [  92] ;  Secrest, 
1958  [540]  ).  2/  Appe  aring  in  1951,  his  Sancti  Thomas  Aquinatis  Hymnorum  Ritualium 
Varia  Specimina  Concordantiar um  is  the  first  known  example  of  a  complete  word  index 
that  was  compiled  by  machine  techniques.    The  early  Gallarate  work  was  carried  out  on 
standard  punched  card  equipment,  but  from  the  time  of  the  concordance  to  the  Dead  Sea 
Scrolls,  computers  have  also  been  used  (Tasman,   1959  [  595],  [  596],  and  [  597]).  The 
major  continuing  task  is  still  to  other  works  of  St.  Thomas.    Other  machine-compiled 
concordances  produced  by  Busa's  Center  include  one  to  Goethe's  Farbenlehre,  Bd.  3. 

17 

Asher  and  Kurfeerst,   1963  [24],  pp. 1-2. 

See  also  Scheele(ed.  ),   1961  [  522],  pp-206-209. 
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Other  relatively  well-known  examples  of  machine- compiled  concordances  include 
those  to  the  Revised  Standard  Version  of  the  Bible  (Ellison,   1957  [  186];  Cook,   1957  [  139]  ) 
and  to  Matthew  Arnold's  poetry  (Painter,   I960  [46l];  Parrish  [467,  468]  ).    The  Cornell 
Concordance  Series,  under  the  general  editorial  supervision  of  Parrish,  includes  in- 
vestigations of  Old  English,  such  as  The  Anglo-Saxon  Poetic  Records  (Bessinger,  1961 
[59]  ). 

The  November  1962  issue  of  Current  Research  and  Development  in  Scientific 
Documentation,  No.  11,  [430],  lists  several  concordances  compiled  by  machine  including 
the  work  of  Sebeok  [  533,   534]  and  associates  at  Indiana  University  on  Cheremis  folksongs, 
the  work  on  the  National  Vocabulary  of  the  French  language  under  Quemada  at  the 
University  of  Besancon,  i.^  the  preparation  of  glossaries  and  concordances  to  the  works  of 
Kant  at  the  University  of  Bonn        ,  and  concordances  to  medieval  German  texts  being 
compiled  by  Wisbey  at  the  University  of  Cambridge  (Wisbey,   1962  [646],  [647]  ).    At  the 
University  of  Gothenburg  in  Sweden,  work  has  begun  on  mechanical  linguistic  analysis  of 
English  language  texts,  using  the  machine- readable  teletypesetter  tapes  used  for  the 
printing  of  paperback  books  (Ellegard,   I960  [184]  and  1962  [l85]  ).         Another  recent 
example  is  that  of  the  work  at  the  Summer  School  of  Linguistics,  University  of  Mexico 
(Grimes  and  Alvarez,   1961  [243]  ).     By  1963,  Marthaler  writes  that  "Compiling  con- 
cordances with  the  aid  of  a  computer  is  already  standard  routine  to  such  an  extent  that 
it  needs  hardly  be  described  in  detail.  "  ^/  As  of  January  1964,  a  general-purpose  com- 
puter program  for  the  IBM  7090  which  can  compile  various  types  of  concordances  has 
been  announced  as  available  from  the  Mechanolinguistics  Project  at  the  University  of 
California.  (1964  [95] 

The  major  advantage  of  using  machines  to  compile  concordances  is,  of  course,  the 
enormous  difference  in  the  time  required  to  complete  the  work.     Thus,  only  120  hours 
were  required  on  the  UNIVAC  computer  to  prepare  the  800,000  words  of  the  Concordance 
to  the  Revised  Standard  Version  of  the  Bible  (Cook,   1957  [  139];  Ellison,   1957  [  186]  ).  ^ 


1/ 

See  "Actes  du  colloque  sur  le  mecanisation.  .  .  " ,   1961  [  l]  ;  Quemada,   1961  [485]  and 
1959  [486];  Centre  d'Etude  du  Vocabulaire  Francaise,  "Specimens  de  Travaux 
lexicographiques .  .  .  ",   I960  [  106]. 

National  Science  Foundations  CR&D  Report  No.   11  [430]  p.  316 
Ibid,  p.  321. 

Marthaler,  1963  [399]  ,  p. 

"California  Concordance  Program  Ava liable",  1 964  [95] 
V     Carlson,  1963  [101],  p.  211. 
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In  the  use  of  the  IBM  705  for  the  concordance  to  the  Summa  Theologiae,   Fr.   Busa  reports 
that  only  60  hours  were  required  to  arrange  in  alphabetical  order  1,  600,  000  words.  ]_l 
This  advantage  of  speed,  with  the  concomitant  benefits  of  both  economy  and  timeliness,  is 
illustrated  by  Tasman  as  follows: 

"...  It  has  been  estimated  that  it  would  take  50  scholars  40  years.  .  .to  manually 
index  the  13  million  or  so  words  of  St.  Thomas  Aquinas'  complete  works.  IBM 
punched  card  machines  would  produce  the  indexes  and  concordances  much  more 
accurately  and  would  take  ten  scholars  about  four  years.    Large-scale  data 
processing  techniques  would  reduce  the  time  to  about  Z5  percent.  .  .  (or).  .  .  ten 
scholars  to  do  the  job  in  less  than  a  year.  "  ^/ 

Other  advantages  stem  from  the  facility  with  which  further  machine  processing  can  be 
introduced.    Once  the  text  is  in  machine- readable  form,   a  number  of  valuable  byproducts 
can  be  derived.    Examples  are  statistics  on  the  number  of  words  that  have  2,  3,  .  .  .n 
letters,  frequencies  of  letter  usage;  printouts. of  occurrences  of  specified  words  or  groups 
of  words;  and  lists  alphabetized  on  terminal  rather  than  initial  letters.    Added  advantages 
of  computer  processing  are  further  exemplified  in  the  options  available  with  the  California 
concordance  computer  program  (1964  [95]),   some  of  which  are  as  follows: 

(1)    The  user  may  obtain  a  restricted  rather  than  a  full  concordance  by  supplying  a 
list  of  words  for  which  no  entries  are  to  be  made. 

(Z)    The  user  may  obtain  a  selective  concordance  by  supplying  a  list  of  words  for 
which,  and  only  for  which,  entries  are  to  be  made. 

(3)  Each  entry  word  may  be  centered  with  its  preceding  and  succeeding  context, 
up  to  the  limits  of  one  full  line  of  131  characters,  or  each  entry  word  may  be 
listed  together  with  the  full  sentence  or  verse  in  which  it  occurs. 

(4)  Text  with  interlinear  information  such  as  grammatical  symbols  can  be  used  and 
selective  concordances  can  be  compiled  on  the  basis  of  such  interlinear 
information. 

(5)  The  citations  of  an  entry  can  be  listed  in  order  of  textual  occurrence,  in  an 
order  determined  by  preceding  or  following  words  in  its  context  or  in  an  order 
determined  by  accompanying  interlinear  symbols. 

2.2   Card  Catalogs,  Book  Catalogs,  Bibliographies  and  Subject  Index  Listings 
Prepared  by  Machine 

The  use  of  machines  such  as  punched  card  equipment  for  the  preparation  and  pro- 
cessing of  library  card  catalogs  and  of  index  listings  was  advocated  by  a  few  far-sighted 
documentalists  at  least  as  early  as  the  1930's  (Parker,   1938  [463];  Dewey,   1959  [  153]). 


1/ 

See  his  statement  in  Scheele,   1961  [  522]  ,  p.  209. 

2/ 

Tasman,   1958,  [596]  ,  p.  11. 
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McCormick's  bibliography  on  mechanized  library  processes  (1963  [407]  )  lists  a  number 
of  early  suggestions,  notably  those  of  Fair  in  1936  [  187],  Shera  in  1938  [  547],  and  Gates 
[  225]  and  Callander  [  96,  97,  98]  in  1946.    Cox,  Bailey  and  Casey  proposed  the  use  of 
punched  card  equipment  for  the  preparation  of  bibliographies  in  the  field  of  chemistry  in 
1945  [ 142]  . 

By  1946,  Gull  claimed  that: 

".  .  .Punched  cards  and  present  equipment  offer  new  possibilities  right  now  for 
solving  the  problems  of  the  indexes  to  Chemical  Abstracts.    These  indexes  are 
large  undertakings  in  themselves,  and  the  work  of  arranging,   cumulating,  and 
printing  them  can  be  simplified  by  placing  the  index  information  on  punched 
cards  at  the  time  the  abstracts  are  made.    With  current  indexes  on  punched 
cards,  two  or  three  cumulations  of  the  author  index  during  the  year  will  greatly 
reduce  the  work  required  in  using  current  issues  from  that  approach.  Cumu- 
lations of  the  subject,  patent,   and  formula  indexes  immediately  become  possible 
for  intervals  more  frequent  than  once  a  year."  [245] 

The  following  year  (1947)  saw  a  summary  by  Gull  of  potential  applications  of  punched 
cards  in  special  libraries  [247],  and  Becker  surveyed  some  of  the  then  discernible 
prospects  for  library  mechanization,  as  a  student  in  the  Library  School  of  Catholic 
University.    He  stressed  such  advantages  as  flexibility  in  the  processing  of  new  material 
for  abstracting,  indexing,  filing,  and  interfiling  purposes  and  the  printing  out  of  various 
listings  in  any  format.  _' 

The  potential  use  of  machines  for  library  science  and  documentation  had  not  actually 
been  recognized,  however,  for  many  years  after  the  invention  of  punched  card  equipment. 
Both  the  punched  card  developments  (beginning  with  Hollerith  and  Powers  in  the  1880's) 
and  the  electronic  computers  developed  from  1946  onward  were  first  applied  to  the  auto- 
matic manipulation  of  information  in  the  sense  of  statistical,  mathematical,  or  engineer- 
ing data,  rather  than  to  information  about  data  or  information  about  other  information. 
Dr.  John  Shaw  Billings,  himself  a  librarian  of   note,  was  apparently  the  first  to  suggest 
to  Herman  Hollerith  the  idea  of  recording  information  as  holes  punched  in  cards  which 
could  then  be  sorted  mechanic  ally.  2/  Lark  ey  comments:    "It  is  not  known  if  Billings  ever 
thought  of  applying  the  principle  to  bibliographic  work,  but  it  would  seem  eminently 
fitting  that  it  might  be  so  utilized.  "  1.1 

Larkey  himself  as  head  of  the  Army  Medical  Library  Research  Project  at  the  Welch 
Medical  Library,  Johns  Hopkins  University,  was  certainly  one  of  the  pioneers  in  such 
utilization,  but  this  was  almost  70  years  from  the  date  of  the  Billings -Hollerith 
conversations.     The  Army  Project,  begun  in  late  1948  or  early  1949,  had  as  its  contract 
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Becker,   1947,  [43],  pp.   11-12:    "From  the  flexible  arrangement  of  the  cards, 
bibliographies  become  readily  available  by  subject,  author,  and  title.    In  special 
libraries,  where  material  on  one  subject  is  concentrated,  the  research  possibilities 
of  gathering,  sorting,  filing,  and  printing  information  are  almost  limitless.  Con- 
tinuous machine  interfiling  permits  keeping  current  with  new  entry  additions.  " 

"With  the  masters.  ..  ",   1963  [648],  p.  18. 

Larkey,   1953  [  35l]  ,  p.  34. 
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objective  "to  explore  existing  and  projected  methods,  emphasizing  machine  methods, 
applicable  to  such  pilot  projects  as  may  be  necessary"  (Larkey,   1949  [348],   1956  [  349], 
and  1953  [35l]  ).    Also  as  of  1949,  the  Library  of  the  Department  of  Agriculture  is 
reported  to  have  "conducted  an  experiment  in  the  use  of  electronic  data-processing 
machines  to  produce  the  author  and  subject  indexes  to  the  'Bibliography  of 
Agriculture ' .  "  1/ 

It  is  not  until  the  early  1950's,  however,  that  punched  card  machine  techniques  were 
actively  put  to  use  for  the  preparation  of  card  catalogs,  book  catalogs,  bibliographies  and 
various  index  listings.     Then,  a  number  of  independent  but  largely  concurrent  applications 
were  tried  out  on  at  least  an  experimental  basis,  including  in  addition  to  the  work  of  the 
Welch  Medical  Library  Project  pioneering  efforts  in  mechanized  book  catalog  production 
(Griffin,   I960  [Z42]  ;  Martin,   1953  [  400];  Berry,   1958  [  58]  )  and  what  is  claimed  to  be  the 
"first  successful  non- experimental  punched-card  catalog  of  periodicals",  the  Serial  Titles 
Newly  Received  (now  New  Serial  Titles),  as  published  by  the  Library  of  Congress  from 
1951  onwards.  \1 

The  work  at  the  Welch  Medical  Library  continued  for  several  years,  the  final  report 
being  issued  in  1955  [234].    Beginning  in  1951,  the  project  maintained  in  punched  card 
form  the  subject  heading  authority  list  used  for  the  Current  List  of  Medical  Literature 
(Larkey,   1953  [  35l]  ;  Garfield,   1953  [217]  and  1954  [220]."  Garfield  has  stated  that  this 
work  "clearly  demonstrated  the  ease  of  converting  alphabetic  subject  heading  lists  to 
categorized  or  classified  lists  of  terms  by  the  use  of  punched  card  .equipment."  1/That  IS  , 
each  heading  or  subheading  had  assigned  to  it  a  numeric  code  reflecting  its  appropriate 
position  in  the  classified  system,  which  could  then  be  used  by  machine  for  sorting, 
ordering  and  listing.    Ingenious  use  was  made  of  the  IBM  101  Statistical  Machine  in  the 
preparation  of  printed  subject  indexes  (Garfield,   1953  [218]  and  1954  [2l6]).  Other 
subject  heading  lists  maintained  by  punched  card  techniques  by  1953  or  earlier  included 
those  of  the  U.  S.  Patent  Office  and  the  Technical  Information  Division  of  the  Library  of 
Congress . 

The  first  loose-leaf  printed  book  catalog  to  be  produced  by  machine  methods  was 
apparently  that  of  the  King  County  Public  Library  in  the  State  of  Washington  in  1951,  and 
the  following  year  the  Los  Angeles  County  Library  inaugurated  a  similar  system  for  the 
distribution  of  a  master  book  catalog  prepared  by  mechanized  techniques  (Berry,  1958 
[58]  ;  Griffin,   I960  [  242]  ;  Martin,   1953  [  400]  ;  Alvord,   1952  [4]). 

The  work  on  mechanized  preparation  of  lists  of  periodicals  at  the  Library  of 
Congress  has  been  reported  as  follows: 

"In  1951,  the  Library  began  publishing,  at  monthly  intervals.  Serial  Titles 
Newly  Received.    In  1953,  its  title  was  changed  to  New  Serial  Titles.  .  . 
Ever  since  its  inception,  the  fundamental  ingredient  of  the  publication  has 
been  the  IBM  punched  card.  .  . 

U.S.  Congress,  Senate  Committee  on  Go  vernment  Operations,  1960[619],  P- 147. 
Dewey,  1959   [153]  ,  p.  36. 

II 

Garfield,  1959   [221]    ,  p. 471. 

4  / 

-       Garfield,  1954  [220]  ,  p.  1 . 
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"Two  important  advantages  of  the  punched- card  method  were  foreseen  when  the 
publication  began.    First,  it  would  be  possible  to  print  lists  from  the  cards  at  will, 
without  any  further  editing  or  proofreading,  once  the  information  was  in  punched-card 
form.    Second,  there  was  the  possibility  of  mechanically  preparing  special  lists  of 
titles,   selected  on  the  basis  of  subject,   country,  or  language.  "  }_l 

Thus,  by  1953,  "a  number  of  instances  of  printed  indexes  prepared  by  machine"  could 
be  claimed.  ^/   The  use  of  punched  cards  to  sort,  to  prepare  tabular  listings  for  various 
drafts  and  revisions,  and  to  interfile  corrected  or  revised  entries  greatly  facilitated  the 
preparation  at  Battelle  Memorial  Institute  of  the  subject  index  to  the  Proceedings  of  the 
International  Conference  on  the  Peaceful  Uses  of  Atomic  Energy,   1955  (Lipetz,  I960 
[  367]). 

Developments  in  the  use  of  punched  card  machine  techniques  in  bibliographic  opera- 
tions of  these  types,  beginning  in  the  1950's,  have  by  no  means  been  limited  to  the  United 
States.    For  example.  Remington  Rand  punched  cards  have  been  used  in  the  preparation  of 
a  national  union  catalog  of  Italian  libraries,  ^1  and  Mikhailov  reports  for  the  All-Union 
Institute  of  Scientific  and  Technical  Information  (VINITI)  as  follows: 

"The  development  program  for  machine  production  of  indexes  has  been  underway 
at  the  Institute  for  a  number  of  years.  .  .In  fact,   operational  use  of  Soviet-made 
punch-card  machines  to  compile  the  author  indexes  for  some  of  the  series  of  our 
Abstract  Journal  has  been  practiced  at  the  Institute  since  1957.  "  ^/ 

I 

In  France,  at  the  Centre  d'Etudes  Nucleaires,  Saclay,  a  program  has  been  developed 
for  mechanization  of  the  production  of  biweekly  and  cumulative  indexes  and  for  demand 
searches  (Chonez,   1960  [ll6,   117,  118]). 

With  the  advent  of  automatic  data  processing  systems,  the  speed,  the  flexibility  and 
the  capability  for  multiple -purpose  processing  buttress  the  claim  that  the  card  catalog  can 
be  "replaced  or  supplemented  by  book  catalogs  made  with  the  aid  of  mechanized  equip- 
ment". —  It  is  further  claimed  that  "The  printed  catalog  produced  by  means  of  automatic 
equipment  combines  the  best  features  of  the  conventional  card  catalog  and  the  traditional 
printed  catalog,  and  adds  to  both  new  dimensions  that  would  have  been  unbelievable  a 
generation  ago.  "        A  joint  project  is  under  way  by  the  Medical  Libraries  of  Columbia, 
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U.  S.  Congress  Senate  Committee  on  Government  Operations,    I960  [619],  p.  85. 

y 

Larkey,   1953,  [351],  p.  38. 
Berry,   1958  [58],  p.  287. 
Mikhailov,   1962  [410],  p.  50. 
McCormick,   1963  [408],  p.  195. 

Vertanes,   1961  [  625]  ,  p.  242.    This  is  with  reference  to  the  LILCO  Library  Printed 
Catalog,  which  is  prepared  by  sorting  and  processing  information  on  titles,  authors 
and  titles -by- subject-groupings  serving  as  indexes  to  the  holdings  at  the  Long  Island 
Lighting  Company. 
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Harvard,  and  Yale  Universities  for  computer  preparation  of  book  catalogs  for  books 
published  from  I960  onward  (Kilgour,  et  al  1963  [  324])-  Another  recent  illustrative 
example  of  the  production  of  printed  book  catalogs  by  means  of  computer  compilation  is 
that  of  the  Boeing  "SLIP"  System  (Weinstein  and  Spry,    1963  [  633]). 

Along  with  recognition  of  computer-processing  potentialities  there  has  emerged 
increased  awareness  of  the  desirability  of  taking  advantage  of  one-time  recording  of 
information  to  serve  multiple  purposes;    the  principle  of  by-product  data  generation.  The 
advantages  for  the  library  and  document  collection  are  that  a  single  recording  of  biblio- 
graphic information  in  machine -usable  form  can  lead  to  a  variety  of  products,  specifically 
including  printed  book  catalogs,  ]_l  recurrent  and  demand  bibliographies,  the  requisite 
number  of  copies  for  conventional  card  catalogs,  card  catalog  sets  or  catalog  listings  for 
the  personal  use  of  the  individual  worker,  input  to  mechanized  selection  and  retrieval 
systems,   and  machine -manipulatable  data  for  such  other  purposes  as  circulation  control. 

Turner  and  Kennedy  report,  for  example,  the  initial  use  of  a  Flexowriter  to  prepare 
library  catalog  cards  and  the  by-product  generation,  via  a  1401  computer,  of  bi-weekly 
listings  of  unclassified  report  titles  at  the  Lawrence  Eadiation  Laboratory,  the  "SAPIR" 
System  (Turner  and  Kennedy,   1961  [6l5]).    Chasen  discusses  a  change  from  a  previous 
punched  card  system  for  circulation  and  recall  at  General  Electric 's  Missile  and  Space 
Division  Laboratory  to  a  combined  Flexowriter  and  G.E.  225  computer  procedure  to 
provide  mechanized  retrieval,   compilation  of  desk  catalogs,   computer  updating  of 
catalogs  and  files,   and  the  maintenance  of  subscription  lists  (Chasen,   1963  [l08]). 

Fasana  describes  a  system  at  the  Air  Force  Cambridge  Research  Laboratory  Library 
where  typing  indications  in  the  tape  are  used  as  boundary  codes.    He  reports: 

"Input  tapes  are  currently  being  processed  on  a  computer  to  automatically  produce 
catalog  card  sets,  circulation  control  records,  and  book  form  indexes.  Original 
input  tapes  now  being  accumulated  will  form  the  basis  of  a  machine- searchable 
file  to  be  used  in  the  future  for  more  sophisticated  printouts  and  searches.  "  ^/ 

For  such  applications,   Durkin  and  White  make  the  following  typical  claims: 

"The  system  described  has  permitted  the  IBM  Command  Control  Center  Engineering 
Library  to  produce  its  catalog  cards  and  library  bulletin  both  faster  and  cheaper. 
Since  a  by-product  of  this  process  is  the  preparation  of  all  catalog  information  in 


1/ 

See  for  example,  Olney,   1963  [458],  p.  42:    "During  the  past  few  years  a  number 
of  libraries  have  initiated  a  program  of  mechanization.  .  .by  punching  on  IBM  cards 
or  paper  tape  some  of  the  bibliographic  information  normally  given  on  catalog  cards. 
Recording  this  information  in  machine -readable  form  makes  it  very  easy  to  prepare 
printed  book  catalogs.  .  .  " 

y 

Fasana,   1963  [l95],  p.   326.     This  system  involves  the  "Machine -Interpretable 
Natural  Format"  and  procedures  developed  for  AFCRL     by  Itek  Corporation; 
see  also  Lipetz  et  al,   1962  [  368]. 
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punched  card  form,  it  has  also  permitted  the  establishment  of  a  circulation  control 
system,  the  publication  of  overdue  notices  and  reading  lists,  and  the  eventual 
institution  of  a  computer  retrieval  program"  (Durkin  and  White,   1961  [  173];  White, 
1963  [  638]). 

Heiliger  reports  for  the  library  of  the  new  Chicago  Campus  of  the  University  of 
Illinois  as  follows: 

"The  type  of  bibliography  the  computer  can  produce  does  make  greater  use  of  LC 
card  information  than  do  present  card  catalogs.    With  the  computer  progrannmed 
with  a  set  of  library  filing  rules  and  a  set  of  symbols  that  describes  for  the  computer 
the  various  parts  of  the  bibliographic  unit,  it  can  print- out,  for  instance,  a  list  of 
books  published  in  a  given  country,  between  certain  years,  on  a  certain  subject  (or 
combination  of  subjects),  that  are  illustrated  and  have  bibliographies.    It  will  also 
be  possible  to  permute  on  individual  items  in  LC  subject  headings  in  the  same  fashion 
that  Chemical  Titles  does  on  titles.    This  index  has  been  dubbed  POSH  (permuted  on 
subject  headings).  "  J_/ 

Some  recent  experimental  work  at  Inforonics,  Inc.  puts  major  emphasis  on  by- 
product data  generation,  beginning  with  the  actual  preparation  of  manuscripts  for  publi- 
cation.   Tape  typewriter  processing  of  manuscript  for  journal  articles  is  being  studied 
from  the  point  of  view  of  producing  machine-usable  text.     This  text,  together  with  coded 
identification  of  the  separate  items  in  the  text,  is  so  prepared  that  computer  programs 
can  produce  from  the  single -input  automatic  typesetting  tapes  for  the  article  itself, 
author  and  subject  index  entries,  and  the  like.    Computer  text  transformations  can  also 
produce  entries  for  citation  indexes,  abstract  journals  and  search  files  (Buckland,  1963 
[83,  84]). 

Other  computer-produced  indexes  or  special  indexes  involving  compilation  rather 
than  selection  by  machine  include  indexes  to  Nuclear  Science  Abstracts  (Day  and  Lebow, 

I960  [  151]),  the  Current  List  of  Medical  Literature  (Chonez,   I960  [ll6,   117,  118]), 

  2  / 

the  Retrieval  Guide  to  Thermophysical  Properties  Research  Literature,        and  the 

Research  and  Development  Abstracts  of  the  USAEC  (Sherrod,   1963  [  541  ]  ) .    At  the 

Atomic  Energy  Commission  also,  a  modification  of  this  RDA  computer  program  is  used 

for  author,  corporate  author,  number  and  subject  indexes  for  the  Engineering  Materials 

List,  which  includes  announcements  of  blueprints  and  drawings.—''  In  several  instances, 

machine  processing  capabilities  are  used  for  permuted  listings  under  various  assigned 

indexing  terms.  4/  Special  cases  of  machine  permutation  operations  involve  compilation 

and  organization  of  chain  indexes,  used  to  reflect  the  various  key  entries  in  faceted 

classification  systems  (Dowell  and  Marshall,   1962  [159];  Foskett,   1962  [199];  Olney 

1963  [458]). 

U 

Heiliger,   1962  [259],  p.  475. 

Markus,   1962  [394],  p.   19;  Touloukian,  1962,  1963  [607]  . 
Davis,   1963  [  150]  p.  237. 

See,  for  example,   reports  on  the  SWIFT  program  for  NASA's  STAR  (Newbaker  and 
Savage,  1963  [438  ]  );  the  AIMS  System  (Heller,  1963  [  260  ],  and  the  SPINSTRE 
System  (Wheater,  1963  [639  ]). 
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A  final  special  case  of  a  computer- compiled  index  should  be  noted.    This  is  the  work 
of  Schultz  and  Siierpherd  with  reference  to  the  annual  meetings  of  the  Federation  of  American 
Societies  for  Experimental  Biology  (FASEB)  (Schultz  andihepherd,  I960  [  532];  Schultz, 
1963  [  527]  ;  Shepherd  196315451"-   }J  The  indexing  terms  are  generated  first  by  the  authors 
of  the  papers  but  are  then  run  against  a  computer  program,  which  by  thesaurus-type  look- 
up  eliminates  synonyms  and  supplies  syndetic  devices  in  addition  to  formatting  the  subject 
index  for  printout. 

The  machine- readable  thesaurus  developed  for  this  project  presently  performs  the 
following  four  basic  functions  (Schultz,   1963  [527]): 

1.  It  accepts  words  from  titles  and  indicia  supplied  by  the  authors  without 
modification  if  they  match  acceptable  indexing  terms. 

2.  It  recognizes  certain  other  words  as  acceptable  if  modified  and  modifies  them 
accordingly,    for  example,  by  "use"  directions  for  synonyms  and  near -synonyms. 

3.  It  adds  additional  indexing  terms  when  certain  words  occur,  an  example  being 
"  'penicillin',  use  also  'antibiotics'." 

4.  It  deletes  certain  words  if  they  do  not  occur  in  the  context  of  an  acceptable 
indexing  phrase. 

2.  3  Tabledex  and  Other  Special  Purpose  Indexes 

The  uses  of  machine  techniques  in  index  compilation  so  far  discussed  represent 
instances  in  which  conventional  tools  of  bibliographic  control  can  be  prepared  at  lower 
cost  or  more  rapidly,  or  both.    In  addition,  however,  certain  new  and  unconventional 
types  of  index  have  been  or  are  being  produced  with  the  aid  of  computers. 

The  Tabledex  method,  as  proposed  by  Ledley  in  1958  (Ledley,   1958  [352],  Zusman, 
et  al,   1962  [  66l]  ;  O'C  onnor,   I960  [442]),,    involves  coordinate  indexing  in  bound  book 
form,  with  special  features  to  facilitate  search,   conserve  space  and  display  index  terms 
co-occurring  with  a  given  term  for  a  given  item.       A  major  advantage  claimed  for  this 
method  is  that  by  the  use  of  computers  bibliographies  and  book-form  indexes  can  be 
organized,   compiled,  and  printed  in  page  format  within  a  matter  of  hours. 

A  Tabledex  index  typically  consists  of  a  bibliography  proper,  in  which  each  citation 
has  been  assigned  an  identifying  number;  an  alphabetical  list  of  the  indexing  terms  used, 


These  investigators  claim  the  first  production    of  a  conventional  subject  index  by 
computer. 

See,  for  example,    O'Connor,   I960  [446],  p.  241:    "Ledley  approximately  halves  the 
average  size  of  the  document  descriptions  required  by  imposing  an  order  on  the 
vocabulary  of  indexing  terms.    When  a  document  description  belongs  in  a  term  subset, 
only  those  terms  of  the  description  need  to  be  recorded  which  come  later  in  term 
order  than  the  term  of  the  term  of  the  subset.    This  illustrates  another  type  of 
storage  organization.  " 
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which  may  also  have  numeric  codes;  and  a  set  of  indexing  tables.    These  tables  contain 
item  numbers  in  the  leftmost  column,  and  either  the  names  or  the  codes  for  indexing 
terms  assigned  to  an  item  along  the  row.     There  is  one  such  table  for  each  distinct  term 
used  in  indexing  the  items. 

To  facilitate  searching,   only  those  terms  which  are  of  higher  numeric  or  alphabetic 
order  than  that  for  the  term  for  which  the  particular  table  is  compiled  are  recorded  in  the 
rows.     Thus  to  make  a  search  on  several  terms,  the  user  turns  to  the  table  for  the  one  of 
these  terms  that  has  the  lowest  term  value,  which  table  records  all  items  to  which  the 
term  has  been  assigned,  and  checks  the  rows  of  the  table  for  the  second  lowest  ranking 
term,  the  third,  and  so  on.    Variations  in  the  Tabledex  method  allow  for  the  automatic 
assignment  of  numeric  codes  to  the  indexing  terms  based  on  relative  frequency  of  use 
within  the  collection.  Ledley  also  discusses  methods  for  finding  articles  associated  with 
all  except  one,  all  except  two,  or  all  except  n  of  the  given  words  in  a  search 
prescription.  ]J 

A  first  example  of  a  computer-compiled  Tabledex  index  was  that  to  a  bibliography 
prepared  by  the  Library  of  Congress  for  the  International  Geophysical  Year  (Zusman 
et  al,   1962  [  661]).  — ^   The  computer  program  for  the  IBM  7090  carried  out  the  operations 
of  assigning  accession  numbers,  extracting  index  terms  and  compiling  the  term  lists, 
determining  frequencies  so  as  to  assign  frequency  numbers  to  the  terms,  organizing  and 
preparing  the  tables,   and  developing  an  author  index.     Two  formats  were  used,   one  giving 
terms  by  numeric  code  and  the  other  spelling  out  the  terms  as  normal  words.     The  latter 
feature  provides  a  measure  of  browsability  in  the  system.  ^/  A  Tabledex  compilation 
program  is  also  in  use  at  the  Applied  Physics  Laboratory  of  Johns  Hopkins  University 
(Olmer  and  Rich,   1963  [454]). 

Another  coordinate  index  search  tool,  making  use  of  what  is  in  effect  a  document- 
descriptor  matrix  with  special  codes  and  column  arrangements  to  save  space  and 
facilitate  rapid  scanning,  is  the  Scan-Column  Index  suggested  in  1960by  O'Connor  [ 449]. 
He  further  suggested  the  use  of  computers  for  compilation,  as  follows: 

"A  computer  can  organize  information  about  documents  into  a  scan-column  index. 
The  input  needed  consists  of  the  document  identifications  and  their  accompanying 

Ledley,   1959  [352],  pp.  1235-1239. 

See  also  National  Science  Foundation  CR&D  No.  11  [430],  pp.  130-131. 

3/ 

Zusman,   et  al  1962,  [661],  p.  ii;    "...  The  word  tables  have  the  advantage  that 
browsing  can  be  accomplished  and  possible  associations  made  during  the  search.  .  . 
Such  'browsing'  can  be  enhanced  by  including  at  the  end  of  each  row  in  a  table  all 
the  other  words  also  associated  with  the  article  of  that  row". 
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index  terms.  .  .  and  an  indication  of  either  the  number  of  columns  desired  or  the 
column  density  desired.     The  computer  will  determine  the  frequency  of  each 
term,  the  positive  and  negative  correlations  of  terms,  and  the  quantity  of  these 
correlations  by  counting  or  sampling  key  figures,   such  as  the  average  number 
of  terms  per  document.    It  then  can  assign  column-character  codes  accordingly."— 

In  1961,   Costello  described  the  use  of  computer  techniques  for  compilation  and 
computer  printout  of  a  dual  dictionary  for  a  coordinate  indexing  system  using  links  and 
roles  at  DuPont's  Polychemicals  Department.    After  manual  analysis,  term-role  assign- 
ments are  keypunched,    the  cards  are  listed  for  editing  including  the  elimination  of 
synonyms  and  the  indication  of  appropriate  postings  to  more  generic  terms,  and  re- 
keypunched  for  conversion  to  magnetic  tape.     Tapes  for  posting  of  items  and  links  to 
term-roles  are  merged  by  computer  with  tapes  giving  alphabetical  equivalents  of  term 
codes  and  with  appropriate  syndetic    indications  for  final  output  on  an  IBM  407  high-speed 
printer  [  14l]  . 

Still  another  instance  of  a  coordinate  index,  modified  to  show  pre-coordination  of 
terms  as  compiled  by  computer,  is  that  ot  the  Electronic  Properties  Information  Center 
(Johnson,   1963  [30l]).    The  system  consists  of  abstract  cards  maintained  in  accession 
number  order,  together  with  machine  printouts  that  pre-coordinate  descriptors  within 
nine  major  categories.    The  listings  of  pre-coordinated  descriptors  are  arranged  in 
three  different  indexes;  alphabetically  arranged  within  each  category,  alphabetized  with- 
out respect  to  category  but  with  code  indication  of  the  category  reference,  and  a  non- 
categorized  listing  arranged  alphabetically  in  reverse  order.    Advantages  of  machine 
processing  include  the  ease  with  which  various  statistical  counts  can  be  made,   such  as 
the  average  number  of  items  in  the  system  for  a  given  material  and  a  specified  property. 
Summary  indications  of  the  state-of-the-art  in  the  field  of  interest  can  be  obtained,  "for 
the  system  will  indicate  not  only  areas  where  research  has  been  done,  but  also  areas 
where  gaps  in  the  literature  occur,  and  a  measure  of  the  growth  of  research  activities 
in  the  field  can  be  developed.  "  2_/ 

2.  4  Citation  Indexes 

"A  citation  index  is  a  directory  of  cited  references  in  which  each  reference  is 

3  / 

accompanied  by  a  list  of  source  documents  which  cite  it.  "       This  is  a  relatively  new 

Tl 

O'Connor,   1962  [449],  pp  18-49. 
Johnson,   1963  [30l],  p.  296. 

3/ 

Sher  and  Garfield,   1963  [  546],  p.  63. 
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type  of  bibliographic  search  tool  that  would  be  almost  impossible  to  compile  without  the 
use  of  machines.  }_l  In  at  least  one  case,  moreover,  the  availability  of  mechanical 
devices  was  itself  the  inspiration  for  the  idea  of  a  citation  index  to  the  scientific  litera- 
ture.   Garfield  states  in  a  1954  paper  that  he  was  led  to  the  idea  of  "Shepardizing"  from 
an  earlier  concern  with  the  development  of  citation  codes  or  "coden"  ^1  that  would 
facilitate  machine  processing  of  bibliographic  and  index  entries.^/ 

The  value  of  Shepard's  Citations  in  tracking  down  precedents  and  decisions  has  been 
recognized  in  the  legal  field  for  many  years.  ^'     The  desirability  of  a  similar  tool  for 
literature  searchers  in  the  fields  of  scientific  and  technical  information  was  suggested 
about  a  decade  and  a  half  ago,  when  Seidell  and  others  proposed  its  use  for  patent 
searching  (Seidell,   1949  [54l]  ;  Hart,   1949  [255]).    In  1954,  the  Bush  Committee  in  its 
considerations  of  the  potential  applicability  of  machines  to  Patent  Office  problems 
received  a  proposal  from  the  Atlantic  Research  Corporation  of  Alexandria,  Virginia, 
which  was  to  cover  "the  development  of  a  Patent  Citation  Index,   comparable  to  Shepard's 
Citations" . 

5/i 

n  the  period  1954-1956,  both  Garfield  ^^and  Fano  -^/independently  advocated 
the  development  of  a  citation  indexing  tool  for  scientific  and  technical  literature.  As 


1/ 

See,  for  example,   Atherton,   1962  [25],  p.  4:    "The  volume  of  data  to  be  processed 
is  so  mas sive  that  proces sing  machines  are  a  neces sity" ;  Garfield  1954  [ZIOJ,  p.  4: 
"Where  such  large  volume  of  data  is  to  be  handled  it  must  be  expected  that 
mechanical  devices  of  high  speed  and  versatility.  .  .  would  probably  be  a  determining 
factor  in  the  system's  success." 

That  is,  brief  codes,   often   mnemonic,  for  journal  title  abbreviations  and  other 
clues  to  publisher  and  date  of  publication. 

Garfield,    1954  [  210]  ,  p.  2. 

4/ 

How  to  Use  Shepard's  Citations  [28l]  has  been  published  periodically  by  Shepard's 
Citations,  Inc.  ,  Colorado  Springs,   since  1873. 

1/ 

U.  S.  Dept.  of  Commerce  "Report  to  the  Secretary  of  Commerce.  .  .  ,  "  1954  [  620]  , 
p.  27. 

Garfield  [210,  211,  212].  Adair,  writing  in  January,  1955,  specifically  acknow- 
ledges a  suggestion  of  Garfield's  (for  1955  [2],  p.  32)  but  Garfield  in  turn  credits 
Adair,  (1963  [  214]  ,  p.  290). 

7/ 

Fano,   1956  [  19l]  ,  p.  3:    "Let  us  accept,  at  least  for  the  sake  of  this  argument,  the 
conclusion  that  linguistic  associations  between  documents  cannot  lead  to  a  satis- 
factory definition  of  a  bibliography.     Then  the  only  other  type  of  association  for 
which  evidence  is  available  is  that  provided  by  simultaneous  relerences  in  tne 
literature,  by  the  concomitant  use  of  documents  by  experts  as  evidenced  by  library 
records,  and  by  other  similar  joint  events.  " 


28 


of  today,  there  are  at  least  five  or  six  instances  of  citation  indexes  that  have  been  pro- 
duced, several  different  experimental  investigations  are  under  way,  and  new  interest 
has  been  generated  by  the  considerations  of  the  Weinberg  Panel.  Thus: 

"Of  the  newer  approaches  to  the  indexing  of  scientific  documents,  the  Weinberg 
Panel  was  particularly  impressed  with  the  citation  index  as  a  promising  biblio- 
graphy tool.    In  order  to  learn  more  about  this  approach,  the  National  Science 
FoTxndation  is  currently  sponsoring  the  compilation  and  publication  of  extensive 
citation  indexes  for  the  fields  of  genetics  and  also  for  statistics  and  probability; 
and  is  supporting  two  kinds  of  experiments  to  evaluate  different  techniques  for 
using  citation  data  in  indexes  and  searching  systems  in  the  field  of  physics.  "  ]_l 

In  general,  the  principle  of  citation  indexing  is  based  upon  the  hypothesis  that  the 
bibliographic  references  cited  by  an  author  provide  significant  clues  to  the  subject  content 
of  the  author's  own  paper  and/or  that  there  is  a  certain  commonality  in  subject  between 
papers  that  cite  the  same  references  or  that  are  co-cited.  ^1  The  principle  can  be  applied 
to  the  compilation  of  bibliographical  or  indexing  tools  in  several  different  ways.  First, 
there  is  the  method  of  citedness,  which  groups  for  a  given  item  the  identifications  of  sub- 
sequent items  that  have  cited  it.    The  converse  of  this  is,  of  course,  the  bibliography  or 
reference  list  of  a  given  item.  ^1  In  the  first  case,  we  are  concerned  with  "descendants," 
and  in  the  list  of  references  with  "ancestors".  4/ 


1/ 

Committee  on  Scientific  Information,  1963,  [135],  p.  16. 

1/ 

Compare  Adair,  1955,  [2],  p.  32,  with  respect  to  Shepard's  Citations  itself: 
"Since  all  of  the  cases  listed  under  a  given  case  have  cited  it,  it  follows  that 
they  must  all  be,  more  or  less,  pertinent  to  the  case  cited.  "    See  also  Kessler, 
1963,  [  320],  p.  1:    "This  method  ...  originated  in  the  hypothesis  that  the  biblio- 
graphy of  technical  papers  is  one  way  by  which  the  author  can  indicate  the 
intellectual  environment  within  which  he  operates,  and  if  two  papers  show  similar 
bibliographies  there  is  an  implied  relation  between  them." 

See  Salton,   1962,  [520],  p.  Ill -3:    "A  citation  index  consists  of  a  set  of  biblio- 
graphic references  (the  set  of  'cited'  documents),  each  being  followed  by  a 
list  of  all  those  documents  (the  'citing'  documents)  which  include  the  given 
cited  document  as  a  reference.    A  citation  index  is  to  be  distinguished  from  a 
reference  index  which  lists  all  cited  documents  under  each  citing  document.  " 

i/ 

See,  for  example,  Tukey,  1962,  [6ll],  p.  5:    "Any  user's  greatest  need  is 
likely  to  be  for  access  to  the  latest  information  rather  than  to  the  oldest,  but 
the  latest  items  are  children,  not  ancestors.    Genealogy  is  important,  but 
progress  requires  tracing  descendants     lung  and  Vandeputte,  I960,  [29l],  p.  1 1  > 
make  a  similar  distinction  between  "histoire"  (antecedents)  and  "filiation" 
(successors). 


29 


A  second  method,  implied  in  Fano's  suggestions  for  the  use  of  relative  frequencies 
of  association  between  items  found  in  the  literature,  is  one  of  citingnes s,  which  groups 
together  items  that  cite  one  or  more  identical  references.    This  method  has  been 
developed  by  Kessler  and  his  associates  as  the  technique  of  "bibliographic  coupling" 
(Kessler,  [317]  through  [323].    The  purpose  here  is  to  identify  groupings  of  related 
items  where  relatedness  is  defined  in  terms  of  the  number  of  references  shared  by  each 
of  the  members  of  the  group  with  some  given  test  paper  or  with  each  other.    It  is  noted 
that  where  the  citedness  index  and  the  reference  list  typically  give  the  bibliographic 
references  themselves  as  the  searching  or  retrieval  tool,  the  bibliographic  coupling 
technique  seeks  rather  to  define  groups  of  similar  papers.  X.I  A  third  method,  and  one 
which  may  be  combined  with  either  of  the  other  two,  is  to  derive  indexing  terms  for  a 
given  paper  from  the  overlay  of  indexing  terms  previously  assigned  to  any  papers  which 
it  cites.    Salton  ^/  further  suggests  that: 

.  .  Citation  indexes  could  be  used  to  extend  a  given  set  of  index  terms  by 
starting  with  the  terms  attached  to  a  given  document  or  document  set,  and 
adding  to  them  the  'related'  terms  obtained  from  new  documents  which  cite 
the  original  ones." 

The  suggested  advantages  of  citation  indexing  include  the  claims  that  this  tool  does 
not  require  trained  indexers,  J./  that  it  is  highly  susceptible  to  mechanization  (Garfield, 
1955  [213],  1956  [212],  1957  [  21 1  ] ;  Atherton,  1962  [25] ;  Becker  and  Hayes,  1963  [45]), 
and  that  it  may  cost  significantly  less  than  subject  indexing.  ^1  A  major  advantage 
claimed  is  responsiveness  to  user,  rather  than  indexer,  interests  and  view  points. 
Some  of  the  representative  claims  with  respect  to  this  factor  are  as  follows: 


See  Atherton  and  Yovich,  1962  [26],  p.  3:    "Kessler' s  method,  however,  does  not 
retrieve  the  references  cited  by  a  paper.    Instead  these  references  are  examined 
to  determine  the  'bonds'  between  papers;  e.g.,  if  two  papers  share  six  references, 
in  common,  they  are  said  to  have  a  'coupling  strength'  of  six.    -^y  applying  either 
of  two  criteria  of  coupling,  one  can  'filter  out  smaller  groups  of  papers'  related 
to  a  given  paper.  " 

Salton,  1962  [  520],  p.  in-8;  see  also  Lesk,  1963  [  356]. 
Atherton,  1962,  [25],  p.  3. 

See  Atherton  and  Yovich,  1962  [26],  pp.  3-4:    "Garfield  estimates  cost  of  abstract- 
ing and  indexing  200,  000  articles  in  one  year  to  be  $3  million.    He  estimates  the 
cost  of  a  citation  index  for  these  same  articles  (approximately  3  million  citations) 
to  be  $300,000."    See  also  Doyle,  1963,  [l62],  p.  8:    "The  editing  labor,  the  input 
preparation  cost,  and  the  automatic  processing  time  are  all  so  small  that  it's  very 
likely  citation  indexing  is  destined  for  a  great  surge  of  popularity  in  the  immediate 
future.  " 

Committee  on  Scientific  Information,  1963  [l35],  pp.  55-56:    "Because  the  mdex- 
ing  is  based  on  the  author's  rather  than  on  an  indexer 's  estimate  of  what  articles 
are  related  to  what  other  articles,  citation  indexes  are  particularly  responsive  to 
the  user's,  rather  than  to  the  indexer's  viewpoint." 
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"The  most  feasible  scheme  for  alerting  individuals  to  what  is  of  interest  in  their 
own  field  requires  an  on-going  up-to-date  citation  index.    For  each  narrow  field 
of  interest  of  an  individual  there  are,  it  is  believed  with  good  reason,  three  to 
five  to  ten  key  items  such  that: 

(cl)     If  he  knew  that  a  new  item  referred  to  one  of  his  key  items, 
the  individual  would  be  glad  to  skim  the  new  item, 

{c2)  An  individual  who  skimmed  all  new  items  referring  to  one 
of  his  key  items  would  be  adequately  alerted  to  the  newest 
results  in  his  own  specialties.  "  i./ 

"A  research  worker  who  finds  one  article  several  years  old  can  relate  later 
developments  by  locating  all  subsequent  articles  that  have  referred  to  it. 
Corrections  and  errata  can  be  brought  together  by  a  citation  index.  "  ^/ 

"Citation  indexing  will  overcome  artificial  dividing  lines  that  are  drawn  in  various 
abstracting  services.  "  ^/ 

"It  is  believed  that  citation  indexes  will  be  useful.  .  .in  bringing  together  related 
materials  in  different  fields  where  the  interrelationships  are  not  readily 
identifiable  from  other  types  of  indexes.  "  ^/ 

"Since  the  end  product  of  a  citation  indexing  is  a  listing  which  collects  in  one 
place  the  bibliographical  descendants  of  a  given  cited  author,  bringing  these 
titles  together  helps  to  illuminate  for  the  searcher  the  extent  and  nature  of 
information  association  patterns  employed  by  other  authors  who  had  a  similar 
or  related  interest  to  his  own.    Its  development,  therefore,  serves  as  an 
approach  to  the  user's  frame  of  reference,  not  the  indexer's.  "  ^/ 

The  importance  of  being  able  to  pick  up  more  than  the  principal  subject  matter 
clues  is  indeed  an  advantage  of  citation  indexing.    Garfield,  commenting  on  the  potential 
cross-breeding  of  interests,  gives  an  example  of  a  personal  search  for  more  information 
on  the  RCA  electronic  scanning  pencil  in  which  he  was  led  to  one  of  Busa's  reports  on 
machine  use  in  philological  analysis  and  to  an  article  of  interest  in  the  field  of  informa- 
tion theory.  _6/  Garfield  further  points  out  that  the  cross-breeding  can  extend  across 


1/ 

Tukey,  1962  [6ll],  p.  9. 

Atherton,  1962  [25],  p.  2.  See  also  Garfield,  1955  [213],  p.  1. 
Atherton  and  Yovich,   1962  [26],  p.  3. 


±1 
6/ 


Brownson,  1963  [82],  p.  3.    See  also  Garfield,  1957  [21l],  p.  4. 
Becker  and  Hayes,  1963  [45],  p.  137. 
Garfield,  1954  [210],  pp. 4-5. 


31 


changes  of  terminology  with  time, —  and  Lipetz  suggests  that  it  can  break  down  barriers 
with  respect  to  use  of  foreign  literature. 

Other  claimed  advantages  relate  to  the  usefulness  of  the  citation  index  for  purposes 
other  than  those  of  direct  literature  search.    Such  other  purposes  include  identification 
of  significant  research  by  "equating  frequency  of  citation  with  relative  significance  of 
subject  matter",  (Salton,  196Z  [520]),  determinations  of  the  number  of  references  cited 
in  a  given  field  or  by  journal  or  publication  date  (Atherton,  1962  [25]),  evaluation  of  the 
relative  importance  of  various  scientific  journals  (Westbrook,  I960  [636];  Kessler,  1961 
[  322]),  tracing  of  trends  in  the  history  of  ideas  or  in  a  particular  field  of  literature 
(Brownson,  1963  [82];  Salton,  1962  [520])  — ^  and  empirical  studies  of  the  frequencies  of 
self- citation,  multiple  authorship,  and  the  like  (Atherton,   1962  [25]). 

A  number  of  disadvantages  of  the  citation  index  are  to  be  noted,  however.    First  is 
the  obvious  lack  of  consistency  between  authors  in  terms  of  whether  or  not  they  cite  the 
prior  literature  at  all  and  in  terms  of  the  completeness  and  correctness  of  the  citations 
they  do  make.  ^1  Atherton  quotes  Westbrook  as  saying: 

"Science  is  subject  to  changing  fashions  of  interest  that  lead  to  a  distorted 
number  of  published  papers  in  a  given  subject  and  an  inordinately  high  level 
of  citations  to  any  one  who  reports  first  on  the  fashionable  subject.  The 
method  will  not  appraise  work  performed  but  not  published.  "  ^/ 

IT 

Ibid,  p.  6:    "Changes  in  terminology  are  to  a  certain  extent  overcome  through  the 
citation  approach,   since  the  author  who  makes  a  reference  to  a  paper  that  is  forty 
or  fifty  years  old  is  making  the  jump  in  terminology  for  us.  "    See  also  Garfield, 
1956  [212],  p.  11. 

Lipetz,   1963,  [366],  p.  265:    "It  is  reasoned  that  availability  of  a  citation  index 
derived  from  Soviet  physics  journals  and  approachable  through  familar  American 
references  should  stimulate  utilization  of  the  Soviet  physics  journals  in  the 
United  States.  " 

See  also  Reisner,   1963  [497],  p.  71:    "Citation  indexes  are  receiving  increasing 
attention  as  bibliographic  aids  and  as  sociometric  tools.    As  sociometric  tools, 
they  are  being  used  to  explore  the  flow  of  information  across  national  boundaries 
and  from  pure  to  applied  fields,  to  determine  the  structure  of  a  field,  and  to 
determine  the  'value'  of  documents  or  authors.  " 

See,  for  example,  Doyle,   1963  [l62],  p.  8:    "The  disadvantages  of  this  kind  of 
indexing  is,  of  course,  that  it  depends  on  authors  providing  ample  and  suitable 
references";  Salton,  1962  [520],  p. 111-7:    "In  many  cases  personal  preferences 
are  evident  both  as  to  number  and  types  of  papers  cited;  authors  have  varying  back- 
grounds, and  there  may  also  exist  a  tendency  toward  self- citation  regardless  of 
relevancy";  Thompson,  1963  [  600]  ,  p.  II-l:    "The  difficulties.  ..  are  largely  due  to 
the  extreme  variability  of  format  and  to  the  lack  of  standardization  which  prevails 
in  the  publication  of  citations.  " 

Atherton,   1962  [25],  p.  4,  citing  J.  H.  Westbrook. 
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An  author  not  cited  frequently  enough  or  not  cited  wit'iin  a  given  time  period  will 
not  appear  in  the  citation  index.    Doyle  points  out  that  there  are  "many  kinds  of  documents 
we  would  like  to  retrieve  where  it  is  not  customary  to  provide  citations  at  all"-  }_l  In  the 
bibliographic  coupling  method,  both  those  papers  which  make  no  references  to  any  other 
paper  and  those  papers  which  do  not  share  at  least  one  reference  with  some  other  paper 
in  the  system  are  automatically  excluded.  ^1 

Other  disadvantages  of  the  citation  indexing  technique  relate  to  difficulties  of  the 
lack  of  standard  practices  in  the  citing  of  references  and  to  problems  of  recognizing 
whether  one  citation  is  or  is  not  equivalent  to  another.    These  are,  of  course,  related  to 
the  normal  difficulties  arising  from  non- standardized  formats  and  practices  in  descriptive 
cataloging,  in  use  of  journal  abbreviations,  in  transliterations  of  foreign  language  titles 
and  names,  and  the  like,  but  they  are  now  aggravated  by  the  present  prospects  for  direct 
machine  processing.    As  Lipetz  points  out: 

"Author's  names  may  be  cited  in  somewhat  different  ways,  and  there  is  no 
simple  mechanical  procedure  for  bringing  together  the  different  versions. 
For  example,  an  author's  name  may  be  cited  both  with  and  without  initials; 
it  would  take  a  comparison  of  the  additional  information  on  the  cited  reference 
to  establish  that  these  authors  are  the  same.    Even  more  difficult  are  the 
problems  of  mechanically  determining  that  a  misspelling  has  occurred.  "  ^1 

Both  the  disadvantages  of  incomplete  and  disproportionate  coverage  and  of  failures 
to  equate  equivalent  citations  are  quite  readily  obvious  to  the  user  of  a  citation  index  if 
he  is  reasonably  familiar  with  the  subject  field  or  document  set  that  is  covered.  Thus, 
the  use  of  the  citation  index  as  the  exclusive  tool  for  literature  search  is  subj  ect  to 
defects  of  both  oversight  and  'over-cite'  which  are  cumulative  and  which  are  often  easily 
recognizable.     Atherton  and  Yovich  emphasize  that:    "Knowledge  of  these  weaknesses 
tends  to  prevent  anyone  from  trusting  the  system's  ability  to  retrieve  the  pertinent 
literature.  "  ^1 

In  general,  however,  the  citation  index  has  not  been  proposed  as  an  exclusive 
means  for  literature  search  and  retrieval,  but  rather  as  one  of  a  set  of  tools  or  as  a 
supplement  to  other  indexes.  — ^  In  this  connection,  it  is  of  interest  to  note  that  a  manual 
technique  of  literature  search  tested  at  The  Thermophysical  Properties  Research  Center 


1/ 

y 

5/ 


Doyle,  1963  [l62],  p.  8. 

See  Atherton  and  Yovich,  1962  [26],  p.  39;  Marthaler,  1963  [399],  p.  23. 

Lipetz,  1962  [364],  p.  262. 

Atherton  and  Yovich,  1962  [26],  p.  39. 

See,  for  example,  Tukey  1962  [6ll],  p.  10:    "The  citation  index,  in  its  retrieval 
and  pursuit  uses,  is  not  something  to  be  used  alone.    Rather,  it  is  the  tool  whose 
presence  makes  all  the  other  tools  more  effective.  " 
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while  not  using  a  citation  index  as  such,  makes  use  of  a  supplementary  citation  tracing 
technique  both  to  shorten  manual  search  time  through  abstract  journals  and  to  follow  up 
additional  search  leads      (Lykoudis,  et  al,  1 959  [  387]  ;  Cezairliyan,  1962  [lOV]).  The 
technique  is  briefly  described  as  follows: 

"One  starts  searching  the  abstracting  journal  beginning  with  the  most  recent 
issue  and  going  back  through  a  number  of  years,  a.    Next,  the  bibliographies 
of  the  papers  located  in  these  a   years  are  searched  for  new  references.  The 
references  found  in  this  second  step  of  the  search  will,  in  general,  cover  a 
period  of  years  (b  -  a).     Then  one  reverts  back  to  searching  through  the  ab- 
stracting journal  again  for  another  period  of  a  years  starting  with  the  year  b. 
This  cyclic  procedure  of  alternate  searches  through  the  abstracting  journal, 
followed  by  searching  the  bibliographies  of  uncovered  papers,  is  repeated  until 
the  total  number  of  desired  years  of  search  is  covered.  1/ 

In  a  sample  search  on  the  thermophysical  properties  of  metals,  the  results  showed 
that  the  cost  of  the  cyclic  procedure  was  only  65%  of  the  cost  of  conventional  manual 
search  using  the  abstract  journals  only. 

Recent  efforts  in  the  development  and  use  of  citation  indexes  proper  include  experi- 
ments in  evaluation  at  the  American  Institute  of  Physics,  ^1  an  extensive  compilation  and 
processing  program  at  the  Institute  for  Scientific  Information,  j^/  and  a  cooperative  pro- 
gram between  the  Statistical  Techniques  Research  Group  of  Princeton  University  and  the 
Bell  Telephone  Laboratories    (Tukey,  1962  [6llJ  and  [612]).    Reisner  has  re- 
ported  work  on  the  compilation  of  a  citation  index  to  30,  000  patent  disclosures  and  its 
experimental  evaluation  in  progress  at  IBM's  Thomas  J.  Watson  Research  Center  (1963 
[497]).    Goodman  is  concerned  with  a  citation  index  to  the  literature  of  new  educational 
media,  especially  that  on  programmed  learning  and  teaching  machines  (1963  [235]). 

At  the  Centre  d'Etudes  Nucleaires  de  Saclay,  a  citation  index  to  papers  in  the  field 
of  thermonuclear  fusion  and  plasma  physics  is  being  prepared.  ^1  Lipetz  is  carrying  on 
work  in  the  preparation  and  evaluation  of  citation  indexes,  beg\in  at  the  Itek  Corporation, 
as  an  independent  worker  and  consultant  to  the  A.  I.  P  .  project.  —  Carroll  and  Summit 
report  that  citation  indexing  is  under  consideration  at  Lockheed's  Missile  and  Space 
Division,  (1962  [l02]).    Kessler  and  associates  at  M.  I.  T.  ^1  and  Salton's  group  at 

17 

Lykoudis  et  al,  1959  [387],  abstract,  p.  351. 

Atherton  and  Yovich,  1962  [26];  National  Science  Foundation's  CR&D  Report 
No.  11,  p.  12. 

Ibid,  pp.  27-28. 

Ibid,  p.  76. 

Ibid,  p.  181. 

^/ 

Ibid,  p.  128. 
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the  Harvard  Computation  Laboratory  (Salton,  1961  [512],  1962  [513],  1963  [514]  and 
[515]),  are  concerned  with  citations  as  a  basis  for  grouping  and  categorizing  sets  of 
related  documents. 

Early  examples  of  citation  indexes  that  have  been  produced  include  the  precedents 
in  the  fields  of  statistics  and  information  theory  listed  by  Tukey.        Tukey  also  refers  to 
early  experimentation  involving  manually  manipulated  card  files  by  J.  L.  Hodges,  Jr.  , 
Charles  H.  Kraft,  and  William  H.  Kruskal.  IlI  Goodman  (1963  [235])  describes  the  use  of 
Termatrex  cards  showing  for  each  item  other  items  cited  by  it. 

Examples  of  machine -compiled  citation  indexes,  however,  are  those  of  Garfield  and 
Sher  in  the  field  of  genetics  (1963  [546]),  Lipetz's  experimental  index  to  the  citations  in 
the  proceedings  of  the  two  United  Nations  conferences  on  the  peaceful  uses  of  atomic 
energy,  (1961  [364],  I960  [  365]),  and  the  citation  index  to  references  listed  in  the 
"Short  Papers"  submitted  for  the  1963  Annual  Meeting  of  the  American  Documentation  . 
Institute  (Luhn,  1963  [377]).    As  of  January,  1964,  the  first  five  volumes  of  Science 
Citation  Index  are  available  from  the  Institute  for  Scientific  Information.    These  volumes 
are  reported  to  have  2,250,000  lines  of  copy  representing  the  computer- compiled  citation 
trails  for  102,  000  articles  published  in  1961.  U 

Preliminary  evaluations  of  the  citation  indexing  principle  have,  as  noted  previously, 
been  carried  out  in  an  American  Institute  of  Physics  project  supported  by  the  National 
Science  Foundation.    One  experiment  involved  the  selection  of  a  single  paper  from  the 
December  1,  1961  issue  of  The  Physical  Review  and  the  tracing  of  references  and  citations 
through  that  journal  for  the  period  1956  to  196Q.    A  bibliography  of  64  papers  was  pro- 
duced as  a  resTolt.    This  was  then  evaluated  by  a  nuclear  physicist,  who  found  that  the 
titles  alone  were  an  insufficient  basis  for  judging  whether  or  not  these  papers  should  all 
have  been  included,  and  who  commented  critically  that  there  was  no  way  of  knowing  if  all 
the   papers  really  relevant  to  the  subject  of  the  test  paper  had  indeed  been  found.  A 
further  check  by  search  of  the  subject  index  did  in  fact  reveal  six  pertinent  papers  which 
had  been  missed  by  the  citation  indexing  technique. 

A  second  experiment  at  the  American  Institute  of  Physics  involved  application  of 
Kessler's  "coupling  strength"  criteria  to  41  of  the  64  papers  selected  in  the  first 
experiment,  the  remainder  being  excluded  because  they  shared  no  references  with  any 
other  paper.    The  resultant  groupings  of  presumably  highly  related  papers  were  also 
evaluated  by  a  subject  matter  specialist,  who  found  them  relevant  to  each  other  but  the 
selection  incomplete.    Atherton  and  Yovich,  reporting  these  A.  I.  P.  experiments,  con- 
cluded that:    "More  work  will  have  to  be  done  before  the  usefulness  of  citation  indexing 
can  be  accurately  determined.  "  ^1 


y 

Tukey,  1962  [6ll],  pp.  23-24. 
Ibid.  p.  24. 

See  news  note.  Special  Libraries,  Jan.  1964,  p.  58. 

±1 

Atherton  and  Yovich,  1962  [26],  p.  22. 
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Kessler  himself  and  his  associates  have  also  conducted  some  experiments  in 
comparative    evaluation  of  indexing  aids  derived  from  citation  data  on  the  one  hand  and 
from  conventional  subject  indexing  on  the  other.    The  basis  for  evaluation  was  a  total  of 
334  papers  published  in  The  Physical  Review  in  1958.    The  study  involved  detailed 
comparison  of  the  ways  in  which  these  papers  fell  into  related  groups  according  to  the 
"analytic  subject  index"  used  by  the  journal's  editors  and  according  to  the  method  of 
"bibliographic  coupling".    The  essentials  of  the  latter  method  are  described  as  follows: 

"a.      A  single  item  of  reference  used  by  two  papers  is  called  one  \xnit  of  coupling 
between  them. 

"b.      A  number  of  papers  constitute  a  related  group  G^,  if  each  member  of  the 
group  has  at  least  one  coupling  unit  to  a  given  test  paper  Pq- 

"c.      The  coupling   strength  between  Pq  and  any  member  of        is  measured  by 
the  number  of  coupling  units  (n)  between  them.  " 

For  the  334  papers,   73  categories  of  the  Analytic  Subject  Index  (ASI)  had  been  used. 
For  the  bibliographic  coupling  method,  each  of  the  papers  was  in  turn  considered  as  the 
test  paper  and  groups  were  formed  for  any  of  the  333  other  papers  that  shared  one  or 
more  citations  with  it.    In  general,  it  was  concluded  that  there  was  good  correlation 
between  the  groupings  of  papers  achieved  by  the  two  methods.    It  should  be  noted,  how- 
ever, that  44  papers  fell  into  no  groups  at  all  on  the  basis  of  the  bibliographic  coupling 
criterion.  2^/ 

Salton  and  associates  at  the  Harvard  Computation  Laboratory  are  also  concerned 
with  the  citation  indexing  principle  as  a  possible  basis  for  grouping  similar  documents. 
They  are  also  concerned  with  evaluation  of  results  so  obtained  by  comparison  with 
document  groups  obtained  by  subject  indexing  means.    In  the  comparative  experiments, 
data  were  first  compiled  for  a  closed  document  set  of  62  items  as  to  similarities  with 
respect  to  both  "citedness"  and  "citingness" .    The  same  items  were  manually  indexed 
and  similarity  coefficients  between  these  items  were  derived  from  overlappings  of 
assigned  index  terms.    When  the  two  measures  of  similarity  were  compared  with  each 
other  and  with  document  associations  obtained  by  random  assignments  of  "citations"  and 
"terms",  the  conclusions  reached  were  as  follows: 

"The  similarity  coefficients  obtained  by  comparing  overlapping  citations  for  a 
sample  document  collection  with  overlapping,  manually  generated  index  terms 
are  much  larger  than  those  obtained  by  assuming  a  random  assignment  of 
citations  and  terms  to  the  documents;  relatively  large  similarity  coefficients 
are  generated  for  nearly  all  documents  which  exhibit  at  least  a  minimum 
number  of  citations;  little  seems  to  be  gained  by  using  citation  links  of  length 
greater  than  two;  for  early  documents,   citedness  furnishes  a  better  indication 
than  the  amount  of  citing,   and  vice  versa  for  recent  documents;  for  documents 
which  can  both  cite  and  be  cited,   equally  good  indications  seem  to  be  obtained 
by  comparing  citing  and  cited  documents.  "  2./ 


Kessler,   1963  [320],  p.  1,  footnote. 
Ibid,     p.  5. 

Salton,   1962  [  520],  p.  III-42. 
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In  the  Salton  project,  tests  of  the  value  of  citation  links  for  the  assignment  of  index 
terms  have  been  made  by  comparing  the  citation  pattern  of  an  "unknown"  document  with 
those  of  other  documents  in  the  collection  to  derive  a  set  of  five  "related"  documents, 
where  relatedness  is  decided  on  the  basis  of  the  magnitude  of  the  similarity  coefficients 
for  the  citation  links.    Any  index  term  that  appears  at  least  twice  in  the  set  of  terms 
previously  assigned  to  the  five  related  documents  is  then  assigned  to  the  new  item.  In 
general,  approximately  50%  of  the  terms  so  assigned  were  also  assigned  to  the  same 
"new"  items  by  human  indexing  procedures.  j_/ 

As  we  have  previously  noted,  however,  the  advantages  of  citation  indexing  are  likely 
to  be  most  effectively  applied  when  used  as  part  of  an  array  of  other  tools.  Tukey 
suggests,  in  particular,  that  permutation  indexes  of  titles,  as  in  KWIC  systems,  would  be 
of  great  value  as  "starter"  and  "re-check"  mechanisms  for  the  use  of  citation  indexes. 
Brownson  reports: 

"Consideration  is  now  being  given  to  the  possibility  of  experimenting  with  a 
'hybrid'  type  of  index  that  would  combine  permuted  titles,  authors,  and  citation 
data.    Such  an  index  might  be  more  useful  than  any  of  the  individual  types  of 
indexes  issued  singly;  and,   since  no  human  indexing  judgment  would  be  involved, 
it  could  be  prepared  largely  by  machine  and  issued  rapidly.  " 

Williams,  while  at  ITEK,  proposed  a  hybrid  integrated  index  combining  listings  by 
authors,  corporate  authors  or  author  affiliations,  keywords -in-context  from  title,  and 
references  to  works  cited  by  and  to  works  citing  an  item,  and  she  also  developed  a  sample 
format  for  selected  items  from  several  journals  in  the  field  of  philosophy. 

Precisely  such  a  hybrid  tool  was  provided  with  the  Short  Papers  for  the  A.  D.I. 
Annual  Meeting  1963,   and  it  was  indeed  issued  rapidly.     A  brief  period  of  only  two  or 
three  weeks  elapsed  between  receipt  of  many  of  the  manuscripts  and  the  distribution  of 
two  automatically  typeset  volumes.    The  second  of  these  volumes  contains  a  KWIC  and 
an  author  index  to  these  papers  themselves,  a  bibliography  and  citation  index  to  all 
papers  referenced  by  them,  and  KWIC  and  author  indexes  to  the  cited  papers,  all 
computer- compiled  within  this  time  period. 


1/ 
2/ 

±1 
5/ 


Ibid,  See  also  Lesk  1963,  [  357],  p.V-8. 
Tukey,   1962,  [6ll],  p.  12. 
Brownson,   1963  [82],  p.  4. 

T.M.Williams,  private  communication,  dated  January  4,  1962. 
Luhn,   1963  [  376],    and    [377],  pp.  353-382. 
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2.  5     Machine  Conversion  From  One  Index  Set  to  Another 


A  final  possibility  in  the  general  area  of  machine  compilation  of  indexes  and  machine 
use  to  improve  the  availability  of  indexes  is  as  yet  in  a  highly  speculative  stage.    This  is 
the  possibility  of  converting  from  one  index  set  to  another  by  machine  look-up  procedures. 
In  the  Welch  Medical  Library  project,  mentioned  earlier,  use  was  made  of  punched  card 
techniques  to  convert  from  one  index  arrangement  to  another,  ]J  but  machine- 
recognizable  identifiers  for  both  arrangements  were  explicitly  encoded  in  the  material. 
In  recent  studies  at  Datatrol,  however,  preliminary  investigations  have  been  conducted 
looking  toward  machine  lookup    of  index-term  equivalence  tables  in  order  to  convert,  for 
example,  DDC  descriptors  to  corresponding  subject  headings  used  in  the  AEC  vocabulary. 

Hammond  and  Rosenborg  (1962  [  250]  and  [252])  report  on  the  compilation  of  a  uni- 
lateral table  of  "indexing  equivalents"  between  approximately  7,000  DDC  descriptors  and 
those  AEC  subject  headings  judged  by  them  to  be  identical,  synonymous,  or  "usefully" 
equivalent,  such  as  one  or  the  other  being  subsumed  by  a  broader  or  more  generic  term. 
Findings  showed  23.  8%  of  the  terms  of  the  DDC  vocabulary  presumably  identical  to  those 
of  AEC,  38.  1%  of  lower  generic  level,  7.  4%  of  higher  generic  level,  and  10.  9%  for  which 
no  useful  equivalents  could  be  found.  A  sample  table  of  indexing  equivalents  was  prepared 
for  DDC-to-AEC  conversion,  but  not  in  the  opposite  direction. 

Since,  in  general,   convertibility  of  indexing  vocabularies  would  be  desirable 
wherever  duplication  of  cataloging  and  indexing  effort  is  likely  to  occur  (that  is,  where 
two  or  more  different  documentation  organizations  receive  at  least  some  of  the  same 
material  as  inputs  to  their  systems),  the  results  of  these  preliminary  studies  are  pro- 
vocative and  appear  to  merit  the  further  study  that  is  being  sponsored  by  an  Interagency 
Task  Group  on  Vocabulary  Study  of  the  Committee  on  Scientific  Information,  under  the 
Federal  Council  for  Science  and  Technology. 

There  are  many  substantial  difficulties,  however.    When  applied  to  actual  indexing 
of  the  same  items  by  the  two  agencies,  it  was  found  that  for  277  items  indexed  by  both 
AEC  and  DDC  (then  ASTIA): 

"ASTIA  used  a  total  of  2,  571  descriptors,  and  AEC  840  subject  headings.  .  .  of 
these,  392,  or  roughly  half  of  the  AEC  terms,  were  either  completely  or,  for 
all  practical  purpose,  identical.  " 

Painter  (1963  [460])  made  further  studies  of  equivalency  in  her  investigations  of 
duplication  and  consistency  of  subject  indexing  at  several  Government  agencies.    For  200 
items  indexed  by  both  AEC  and  DDC,   she  found  20%  DDC  equivalency,  67%  AEC  equiva- 
lency, and  30%  similarity  of  actual  indexing.    She  concludes,  in  part: 

"In  considering  these  solutions  and  the  statistics  revealed  by  the  studies  it  should 
be  concluded  that  with  a  maximum  of  only  69  percent  equivalency,  or  convertibility, 
and  a  minimum  of  28  percent,  there  is  still  a  large  proportion  of  terms  which  will 


Garfield,  1959  [22l],  p.  471. 
Hammond  1962  [250]  ,  p.  4. 
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necessitate  some  other  form  of  retrieval.  This  is  the  proportion  which  is  involved 
with  the  problem  of  generics,  where  a  term  in  one  system  subsumes  two  of  another 

 and  vice-versa.    An  additional  problem  evolves  in  attempting  to  reconcile  two 

different  subject  concepts,  one,  the  subject  heading  which  usually  has  a  single 
access  point  and  one,  the  Uniterm  or  descriptor  which  has  multiple  access  through 
coordination.     Thus  the  practicality  of  a  system  made  up  of  many  units  supplying 
information  indexed  differently,  using  as  a  basis  for  retrieval  a  table  of  equivalents, 
is  questionable.  "  — 

Moreover,  the  results  of  tests  of  inter-indexer  consistency  rates  within  the  same 
agency  were  not  encouraging.    Thus  Painter  further  concludes: 

"Tne  study,  in  combining  the  results  of  the  equivalency  analysis  and  the  consistency 
of  indexing  within  each  system  and  an  equivalency  of  only  30  percent  within  the 
broadest  system,  a  table  of  equivalents  is  at  present  of  little  value  in  either  a 
manual  or  a  machine  system.    In  order  to  apply  a  table  of  equivalents  efficiently, 
both  a  high  degree  of  consistency  and  a  high  degree  of  equivalency  is  essential.  "  ^1 

She  therefore    stresses  that  the  possibilities  for  conversion  by  machine  techniques 
from  one  indexing  set  to  an  equivalent  set  for  another  vocabulary  are  adversely  affected 
by  the  generally  poor  rates  of  inter-indexer  consistency.    With  reference  both  to  the 
Datatrol  Studies   3^/  and  to  corroborative  findings  of  her  own,   she  states: 

"The  value  of  equivalency  studies  and  most  particularly  the  table  of  equivalents 
presuppose  the  consistency  of  indexing.    Convertibility  between  systems  is  thus 
dependent  on  the  consistency  of  indexing.    Without  consistency,  the  vocabularies 
as  units  are  not  sound;  equivalencies  Ccinnot  be  drawn  or  effectively  used  for 
convertibility.  "  ^1 

II 

Painter,   1963  [460],  p.  104. 
Ibid,  p.  ix. 

Hammond,   1962  [250];  Hammond  and  Rosenborg,   196Z  [252]. 

±1 

Painter,   1963,  [46O].  p.  109.    Note  that  these  estimates  of  inter-indexer  con- 
sistency may  be  quite  optimistic,  as  discussed  on  pp.  157-l60of  this  report. 
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3. 


INDEXES  GENERATED  BY  MACHINE- -AUTOMATIC  DERIVATIVE  INDEXING 


We  have  noted,  in  the  earlier  statement  of  the  scope  of  this  survey,  a  distinction 
between  "derivative"  and  "assignment"  indexing.     This  distinction  is  related  directly  to 
the  question:    "Is  what  can  be  done  by  machine  properly  termed  'abstracting',  'indexing', 
or  'classifying'?"    It  relates  also,  as  we  have  remarked,  to  a  continuing  controversy  far 
older  than  any  question  of  the  introduction  of  machine  techniques- -that  between  "word" 
and  "concept"  indexing,  between  "uniterms"  if  selected  directly  from  the  text  and 
"descriptors"  in  the  sense  of  their  being  indexing  terms  selected  so  as  to  have  "a  care- 
fully specified  meaning  for  retrieval",        to  say  nothing  of  contrasts  with  subject  heading 
schemes  and  classification  schedules. 

Some  of  the  major  arguments  pro  and  con  derivative  (usually  word)  and  assignment 
(usually  concept)  indexing  will  be  considered  in  a  subsequent  section  of  this  report  on  the 
problems  of  evaluating  indexing  methods.    Nevertheless,  the  present  popularity  of 
automatic  derivative  indexes  of  the  KWIC  type,  while  subject  to  all  the  disadvantages 
typically  cited  for  all  purely  derivative  indexing  systems,  does  show  the  actuality  of 
automatic  indexing  potentialities  and  may  in  fact  hold  the  promise  of  solving  some  of  the 
present-day  problems  of  subject  control. 

In  this  section,  we  shall  consider  first  the  straightforward  word  extraction  tech- 
niques used  in  KWIC  type  indexes.    Possibilities  for  modified  derivative  indexing  by 
title  augmentation,  manipulation  of  word  groups  and  use  of  special  clues  in  keyword 
selection  are  then  discussed,  including  work  by  Baxendale,  Luhn,  and  Artandi.  Related 
research  and  developments  efforts  work  in  automatic  abstracting   which  lend  themselves 
to  derivation  of  indexing  terms  includes  proposals  and  experiments  by  Luhn,  Oswald, 
Edmundson,  Wyllys,  Doyle,  and  Lesk  and  Storm,  among  others.    Some  comments  will 
be  given  on  the  quality  of  modified  derivative  indexing  by  machine.    Automatic  derivative 
indexing  at  the  time  of  search,  as  in  the  natural  language  text  searching  systems  of 
Swanson,  Maron,  Kuhns,  and  Ray,  and  Eldridge  and  Dennis,  will  be  discussed  in  a  later 
section  of  this  report.  ^1 

3.  1      KWIC  Indexes 

The  development  of  computer-generated  permuted-title  keyword  indexes,  especially 
in  the  issuances  of  Chemical  Titles  and  B.A.  S.I.  C.  (Biological  Abstracts -Subjects -In 
Context)  has  been  hailed  by  some  as  "the  miracle  of  the  decade"  and  "the  greatest  thing 
to  happen  in  chemistry  since  the  invention  of  the  test  tube".  —   The  major  reason  for 
the  optimistic    enthusiasm  is  the  speed  with  which  the  computer  can  produce  can  produce 
a  complete  index  to  some  specific  set  of  books,  documents  or  papers  so  that  publication 
and  dissemination  of  the  index  can  be  prompt  and  thus  serve  as  an  important  tool  in 


Mooers,   1963  [423],  p.  3. 

2/ 

See  pp.  132-136. 

Quoted  by  D.  R.  Baker  statement  in  "U.  S.  Congress,  Senate  Committee  on 
Government  Operations  ".  1960  [6l9],  p.  169- 
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maintenance  of  truly  current  awareness.  For  example,  Herner  in  his  1961  review  of  the 
state-of-the-art  of  organizing  information  says: 

"I  am  told  that  the  American  Chemical  Society  has  never  had  a  more  successful 
basic  science  publication.     The  key  to  the  whole  thing  is,  I  believe,  the  extreme 
currency  of  Chemical  Titles.     This  in  turn  derives  from  the  speed  and  simplicity 
of  the  KWIC  process.  "  }_l 

Conrad  reports  as  follows: 

"Reception  of  B.  A.  S.I.  C.   ...  has  been  so  extremely  enthusiastic  .  .  .  that  we 
are  excited  by  the  possibilities  of  producing  permuted  title  indexes  in  one  or 
more  additional  languages.    The  creation  of  a  B.  A.S.I.  C.  index  in  any  language 
requires  only  that  the  titles  be  translated  and  punched  on  cards.  Alphabetical 
arrangement,  permutation  and  'type-setting'  is  completely  automated  and,  for 
5,  000  titles  takes  only  two  hours  to  accomplish.  " 

3.  1.  1   Applications  of  KWIC  Indexing  Techniques 

The  KWIC  type  process  is  indeed  simple  and  straightforward.    The  words  of  the 
author's  title  are  prepared  for  input  to  the  computer  by  keystroking,  either  to  punched 
cards  or  to  punched  paper  tape.    After  being  read  by  the  computer,  the  text  of  a  title  is 
normally  processed  against  a  "stop  list"  to  sliminate  from  further  processing  the  more 
common  words,  such  as  "the",  "and",  prepositions,   and  the  like,  and  words  so  general 
as  to  be  insignificant  for  indexing  purposes,   such  as,  "demonstration" ,  "typical", 
"measurements",  "steps",  and  the  like.    The  remaining  presumably  ''significant"  or 
"key"  words  are  then,  in  effect,  taken  one  at  a  time  to  an  indexing  position  or  window, 
where  they  are  sorted  in  alphabetical  order.    The  result  is  a  listing  of  each  such  word 
together  with  its  surrounding  context,   out  to  the  limit  of  the  line  or  lines  permitted  in  a 
given  format.    As  each  key^vord  is  processed,  the  title  itself  is  moved  over  so  that  the 
next  keyword  occupies  the  indexing  position,  and  this  process  is  repeated  until  the  entire 
title  has  thus  been  cyclically  permuted. 

A  number  of  formats  are  available  in  which  the  length  of  the  line,  the  position  of 
the  indexing  window,  and  the  extent  of  "wrap-around"  (bringing  the  end  of  a  title  in  at  the 
beginning  of  a  line  to  fill  space  that  would  otherwise  be  left  blank)  are  major  variables. 
Current  examples  of  KWIC  type  indexing  output  are  shown  in  Figures  2  through  7. 
Usually,  the  indexing  window  is  located  at  or  near  the  center  of  the  line  with  several 
extra  spaces  to  the  immediate  left  or  with  other  devices  such  as  the  shading  of 
B.  A.  S.  I.  C .  to  aid  the  searcher  in  scanning  down  the  keywords  listed.    This  is 
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Conrad,  1962  [137],  p.  378A. 
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158  TENSION    AND    THE    RATE    OF   NET    ENTRY    OF    CALCIUM-45    IN    ISOLATED    PERFUSED    RABBIT  VENTRICLES 

20  of      hexobarbltal    by   microsomal   enzymes   of  liver 

21  of      hexobarbltal    by   mtcrosonal    enzymes   of  liver 

150  hydro    ergokryptlne    with      dl    hydro    ergocornlne    and        dl    hydro    er goc r I s 1 1 neL h y de r g I ne  ]  Inhibit 

150  hydro   ergocornlne    and        dl    hydro    er goc r I s 1 1 n e[ hy de r g I ne  ]    Inhibit    action   of        vasopressin  bu 

ISO  tolazollne    and      dl    hydro    ergokryptlne    with      dl    hydro    ergocornlne    and        dt    hydro  erg 

8  but*    action   antagonized    by      ergotamlne    at    higher  dosage 

8  of   rabbitt    action    Increased   by      ergotamlne    at    low   dosage    butt    action    antagonized    by  ergot 

5  dog   and   cati    action   reversed   by        ergotamlne      ergotoxin        guanethldlne    and      phenylephrine  bu 

5  action   reversed   by        ergotamlne      ergotoxin        guanethldlne    and      phenylephrine   but    In    rat.  ac 

119  Streptococcus   pyogenes*    Erslpelothrix    Insldlosa    and   Streptococcus   agalactae    In  wit 

42  in    human    accompanied    by    erythema   [mild]   given   by    Injection    Into   skin  lesions 

46  pyrrolldone    causes    aggregation    of    erythrocytes    In    blood   of    hamster    given    t n t ra— ar t er I  a I ly 

46  A    SYNTHETIC    MACROMOLECULE    ON    THE   ERYTHROCYTES    OF    THE  BLOOD 

162  chloramphenicol    and      erythromycin    strongly    Inhibit    endotrophlc    sporulatlon    of  B 

166  ERYTHROMYCIN-    AND    STREPTOMYCIN-LIKE    ANTIBIOTICS    AS  BLEACHI 

32  acid    In    acid— soluble    fraction    of   Escherichia  coll 

36  amino    uridine    Inhibits   growth    of   Escherichia    coil    and  Neurospora 

37  cytldlne    Inhibit   growth    of   Escherichia    coll    but    do   not    Inhibit    growth   of  Neurospora 
35  amino    uridine    Inhibits   growth    of   Escherichia   coll    K— 12t    action    reversed   by      glutathione  L— 

32  fluoro  uracil  has  toxic  action  on  Escherichia  coll  while  organism  Is  growing  actively*  actio 
34  amino    uridine    Inhibits   growth    of   Escherichia    colt*    action    reversed   by        glutathione    •*  acti 

33  deoxy    uridine    Inhibit    growth   of   Escherichia    coll*    action    reversed   by      uridine  cytldlne 
119  growth    of   Staphylococcus   aureus*    Escherichia    coll*    Salmonella    typhosa*    Pasturella  reultoclda 

32  pimellc   acid    In   cell    walls    of    Escherichia   colli    Increases   content    of   N— acetyl    hcxos  amin 

32  content    of   N— acetyl    hexos   amine    esters   and   dtamino   pimellc   acid    In   acid— soluble    fraction  o 

157  estll   general    anesthetic    decreases   urinary   output   and  Incr 

157  AFTER    GENERAL    ANESTHESIA    WITH  ESTIL. 

64  cyclo    pentyl    propionate    and      estra    dl    ol    valerate    hormone   cause   edema   and   thickening  of 

53  THE   TERATOGENIC    ACTION   OF    ESTRADIOL    AND    THYROXINE    ON    MUELLER'S    DUCT    IN    THE    CHICKEN  E 

63  Increases    excretion    of      estrone      estradiol      estrlol    and    total    neutral    17—   keto   steroids  In 

53  estradiol    Inhibits    formation    of   Mueller's    duct    of  chicken 

55  estradiol    with      thyroxine    strongly    Inhibit    formation    of  st 

63  excretion    of      estrone      estradiol      estrlol    and   total    neutral    17—   keto   steroids    In   urine   of  yo 

63  URINE    ESTROGEN    RESPONSES    TO   HUMAN    CHORIONIC    GONADOTROPIN    IN  YOUN 

136  methoxyest  ra— 1*3*  5—  tri    ene   has    estrogenic  action 

135  progestational    action    on    Immature    estrogen— primed    rabbit    and   does    not    Inhibit   growth    of  adre 

131  progestational    action    on    Immature   estroge  n—  primed    rabbit    given  orally 

132  progestational    action    on    Immature    estrogen— primed    rabbit    given  orally 

133  progestational    action    on    Immature   estrogen— primed   rabbit    given  orally 

143  action    on    Immature    rabbit   [estr  oge  n— p  r Imed]   given  subcutaneously 

63  Increases    excretion    of      estrone      estradiol      estrlol    and   total    neutral    1 7—   keto    s  t  e 

5  p he ny I  )— 2— (  1  so    propyl    amino)    ethanol    have    hypotensive    action    on    barbiturate  narcotized 

144  2— dl    methyl    amino    ethanol    Increases    Incorporation    of   phosphorus    Into  phospha 

144  THE   EFFECTS   OF    2-DI    METHYL    AMINO    ETHANOL    ON    BRAIN   PHOSPHO   LIPID  METABOLISM 

148  sulfate      choline    phenyl    ether    bromide      DMPP    and      hist   amine    acid   phosphate    In  IsoL 

133  sterone*    less   effective    than      ethinyl    testo    sterone   moderately    have   progestational  actio 

131  more    effective    than      ethinyl    testo    sterone    strongly   has   progestational    action  o 

132  sterone*    eq.ua  I    In    action    to      ethinyl    testo    sterone   strongly   have    progestational  action 

43  ANTAGONISM    OF    LYSERGIC    ACID   01    ETHYL    AMIDE   BY    CHLORPROMAZ I NE    AND   PHEN    OXY   BENZ  AMINE. 

44  recognition    of      Lysergic    acid   dl    ethyl    amide    In    human    If    given  simultaneously 

43  recognition    of      lysergic    acid   dl    ethyl    amide    In    human    only    If   given   previous    to  latter 

13  catechol  amines  caused  by  phen  ethyl  amine 
16  catechol    amines    caused   by      phen    ethyl  amine 

14  catechol  amines  caused  by  phen  ethyl  Amine  and  does  not  Inhibit  secretion  of  catechol  ami 
12  catechol    amines   caused   by      phen    ethyl    amine      nicotine    and  carbachol 

80  more   effective    than      or- 3— (2— dl    ethyl    amino    ethyl)    amino    troplne    bis   meth    Iodide    has  nicot 

80  B— 3— (2— dl    ethyl    amino    ethyl)    amino    troplne    bis   meth    Iodide*    more  eff 

113  substituted   benzyl    and   phen    ethyl    hydr   azines   have    toxic    action   [LDso    292«4000-t-*  400t» 

145  trI  ethyi  tin  and  trI  ethyl  lead  very  strongly  Inhibit  metabolism  of  glucose  by 
145  THE    ACTION   OF   TRI    ETHYL   TIN.    TRI    ETHYL   LEAD.    ETHYL    MERCURY    AND    OTHER    INHIBITORS    ON   THE  META 

145  OF    TRI    ETHYL    TIN.    TRI    ETHYL   LEAD*    ETHYL    MERCURY   AND    OTHER    INHIBITORS    ON   THE    METABOLISM    OF  BR 

146  ethylmercurychlorlde  chlorpromazlne  malonlcacld[ass 
145  trI  ethyl  tin  and  tri  ethyl  lead  very  strongly  Inhibit  metabo 
145  THE    ACTION    OF    TRI    ETHYL   TIN,    TRI    ETHYL   LEAD.    ETHYL    MERCURY    AND   OTHER  INHIBIT 

80  than      a— 3— ( 2— d 1    ethyl    amino    ethyl)   amino    troplne    bis   meth    Iodide    has   nicotinic  blockin 

80  6— 3— (2— dl    ethyl    amino   ethyl)    amino    troplne    bis    meth    Iodide*    more   effective  than 

140  given    as    salts    with   dl    benzyl    ethylene  diamine 

139  01—9—   ethyl- 2'—   h  y  dr  o*y— 2  •  5— d  I    methyl— 6*7-    benzo   morphan   and  fl— 

139  morphan   and      6—   5 *  9— d 1    methyl— 2—   ethyl— 2'—   h y d r o x y— 6 • 7— be n z o   morphan   weakly   have    toxic  act! 

83  l— (2— p—    amino   phenyl)   e t hy I— 2—   methyl— 3—    phenyl  — 3-    proplon   oxy   pyrrolidine  has 

Figure  5.    Sample  Page,  Cliemlcal  Biological  Activities 
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NOZZLE 
NOZZLE 
NOZZLE 
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NUCLEAR 


ABSORPTION  OF  D-GLUCOSE  BY  SEGMENTS  OF  INTESTl 
NE  FROM  ACTIVE   AND  HIBERNATING,    IRRADIATED  AND 

NON-IRRADIATED  GROUND  SQUIRRELS,    CITELLUS   T« I  NUCLEAR 
DECEHLINEATUS  NASA  N63-11002IK)      $2.50  0726 

CORRELATIONS   IN  A   NON- I SOTHERMAL   PLASMA  NUCLEAR 

AO-290  053(K)      »1.10  0196 
INVESTIGATION  OF   MICROWAVE  NON-LINEAR   EFFECTS  NUCLEAR 
UTILIZING  FERROMAGNETIC  MATERIALS 

AO-290  572(K)  J2.60  0<.87 
BIBLIOGRAPHY  AND  TABULATION  OF  DAMPING  PROPERT 
lES  OF   NON-METALLIC  MATERIALS  NUCLEAR 

AD-289  856(K)  t3.00  0502 
NOTES  ON  NON-MILITARY  MEASURES  IN  CONTROL  OF  I 
NSURGENCY  AD-290  237(K)      tl.60  0696  NUCLEAR 

JUDGMENTS  OF   VISUAL   VELOCITY   AS   A   FUNCTION  OF 
THE   LENGTH  OF   OBSERVATION  TIME   OF   MOVING  OR  NO 
N-MOVING   STIMULI  PB   162   5',9(K)      11.60  0125  NUCLEAR 

TABLES  OF   NON-REL A T I  V  I  ST  I C   ELECTRON  TRAJECTORI 
ES  FOR  FIELD   EMISSION  CATHODES  NUCLEAR 

AO-290  696(K)    SI*.. 50  0239 
NON-SIMILAR   NUMERICAL   METHODS  OF    SOLUTION  FOR 
ELECTRODE   BOUNDARY  LAYERS    IN  A  CROSSED  FIELD  A  NUCLEAR 
CCELERATOR  AO-290   525(K)      15.60  0185 

NONDESTRUCTIVE   SYSTEM  FOR    INSPECTION  OF  FIBER  NUCLEAR 
GLASS-REINFORCED   PLASTIC   MISSILE  CASES 

AO-289  825{K)      H.60  0632 
X-RAY   IMAGE   SYSTEM  FOR  NONDESTRUCTIVE   TESTING  NUCLEAR 
OF   SOLID   PROPELLANT   MISSILE   CASE   WALLS   AND  WEL 
DMENTS  AD-289  821(K)      t3.60  0637 

MAGNETOHYORODYNAHIC   STABILITY   OF   VORTEX   FLOW  -  NUCLEAR 
A  NONDISSIPATIVE,    INCOMPRESSIBLE  ANALYSIS 

ORNL-TM-402 ( K )      13.60  0615  NUCLEAR 
SCALE   EFFECTS   FOR   NONEOU I L I  BR  I UM   CONVECTIVE  HE 
AT  TRANSFER   WITH   SIMULTANEOUS   GAS   PHASE   AND  SU 
RFACE   CHEMICAL   REACTIONS.    APPLICATION  TO  HYPER 
SONIC   FLIGHT   AT  HIGH   ALTITUDES  NUCLEAR 

AD-291   032(K)      tl.60  0025 
APPLICATION  OF  VARIATIONAL   EQUATION  OF  MOTION  NUCLEAR 
TO  THE   NONLINEAR   VIBRATION  ANALYSIS  OF  HOMOGEN 
EOUS  AND   LAYERED   PLATES   AND  SHELLS  NUCLEAR 

AO-289  868(K)  12.60  0667 
EXTENSIONS   IN   THE    SYNTHESIS  OF   TIME  OPTIMAL  OR 

BANG-BANG  NONLINEAR  CONTROL   SYSTEMS.    PART    I.  NUCLEAR 
THE   SYNTHESIS   OF   QUAS I -STAT  I ONARY   OPTIMUM  NONL 

INEAR  CONTROL   SYSTEMS  NULL-ZONE 

PB  162  547(K)  l'..60  0235 
EXTENSIONS   IN   THE   SYNTHESIS  OF   TIME   OPTIMAL   OR  NUMBERS 

BANG-BANG  NONLINEAR   CONTROL   SYSTEMS.    PART  I. 
THE   SYNTHESIS  OF   QUAS I -STAT  1 ONARY  OPTIMUM  NONL 
INEAR  CONTROL  SYSTEMS 

PB   162   547(K)      1',.60  0235  NUMBERS 
NONLINEAR   FLEXURAL  VIBRATIONS  OF    SANDWICH  PLAT 
ES  AD-289  871(K)      12.60  0669 

OPTIMUM  NONLINEAR   CONTROL   FOR   ARBITRARY  DISTUR 

BANCES  NASA   N62-15890(K)      12.60  0682  NUMERICAL 

A  TECHNIQUE  FOR  NARROW-BAND  TELEMETRY  OF  NONRE 
CURRENT   PULSES  AD-290  697(K)      12.60  0577 

ELECTROMAGNETIC   SCATTERING  FROM   A   SPHERICAL   NO  NUMERICAL 
NUNIFORM  MEDIUM.    PART    II.    THE   RADAR  CROSS  SECT 
ION  OF  A  FLARE  AD-289  615(K)      12.60  07t7 

ELECTROMAGNETIC   SCATTERING  FROM  ASPHERICAL  NON  NUMERICAL 
UNIFORM  MEDIUM.    PART    I.   GENERAL  THEORY 

AD-289  614(K)      12.60  0748 
PROBABILITY    INTEGRALS  OF   MULTIVARIATE   NORMAL   A  NYSTAGMUS 
NO   MULTIVARIATE-T  AD-290  7'.6(K)      18.60  0760 

RESONANCE   ABSORPTION  OF   GAMMA-RAYS    IN  NORMAL  A 
NO  SUPERCONDUCTING  TIN 

AO-289  SAi-IK)      13.60  0826  OAK 
NORMS   FOR  ARTIFICIAL  LIGHTING 

AD-290   555(KI      11.10  0734  OBJECTS 
FACTORS   INFLUENCING  VASCULAR   PLANT  ZONATION  IN 
NORTH  CAROLINA  SALTMARSHES 

AO-290  938(K)      17.60  0603  OBSERVATORY 
SONAR   STUDIES   OF   THE   DEEP   SCATTERING  LAYER  IN 
THE   NORTH   PACIFIC  PB   162  427IK)      12.60  0587  OCEAN 

THE  DEVELOPMENT   OF   RESCUE   AND   SURVIVAL  TECHNIC 
UES   IN   THE   NORTH   AMERICAN  ARCTIC 

PB   162   410IK)    112.00  0085  OCEANOGRAPH IC 

THE   FLORA  OF   HEALTHY   DOGS.    1.    BACTERIA  AND  FUN 
GI   OF    THE   NOSE,    THROAT,    AND  LOWER  INTESTINE 

LF-2(K)      12.60  0458  OCEANOGRAPH I C 

FABRICATION  OF   PYROLYTIC   GRAPHITE   ROCKET  NOZZL 

E   COMPONENTS  PB   162   371(K)      11.10  0351  OC EANOGRAPH I C 

FABRICATION  OF  PYROLYTIC  GRAPHITE  ROCKET  NOZZL 
E   COMPONENTS  PB   162   370(K)      11.10  0353 

FABRICATION  OF   PYROLYTIC   GRAPHITE   ROCKET  NOZZL 

E   COMPONENTS  PB   162   372tK)      12.60  0352  OCEANOGRAPHIC 

THIRD   SYMPOSIUM  ON  ADVANCED  PROPULSION  CONCEPT 
S   SPONSORED  BY  UNITED   STATES   AIR   FORCE  OFFICE 
OF   SCIENTIFIC   RESEARCH   AND  THE   GENERAL  ELECTRI 

C   COMPANY  FLIGHT   PROPULSION  DIVISION  CINCINNAT  OCEANOGRAPHIC 
I,   OHIO  OCTOBER   2-4,    1962.    PLASMA   FLOW   IN   A  MA 
GNETIC   ARC   NOZZLE  AD-290  082(K)      12.60  0147 

HEAT   TRANSFER  AND   PARTICLE   TRAJECTORIES   IN  SOL 

ID-ROCKET  NOZZLES  AD-289  681(KI      15.60  0030  OCEANOGRAPHIC 

DEVELOPMENT   AND   STANDARDIZATION  OF   FORMS   3  AND 
4  OF   THE   NROTC  CONTRACT   STUDENT   SELECTION  TES 
T  AO-290  784(K)      11.10  0201 

EVALUATION  OF  NROTC  AVIATION  INDOCTRINATION  FI  OCEANOGRAPHIC 
ELO  TOURS  FOR  1961-1952 

AD-290  355(K)  11.60  0581 
A   7090  CODE   FOR   THE   CALCULATION  OF   ELECTROMAGN  OCTYL 


ETIC  BLACKOUT  FOLLOWING  A  HIGH  ALTITUDE  NUCLEA 
R  DETONATION  AO-291    141(K)      18.60  0372 

ACCURATE   NUCLEAR   FUEL   BURNOP  ANALYSES 

G6AP-4062(K)      11.60  0362 
APPLICATION  OF   NUCLEAR   POWER   SUPPLIES   TO  SPACE 
SYSTEMS  T10-17306IK)      18.50  0741 

CAROLINAS-VIRGINIA  NUCLEAR  PQhER   ASSOCIATES,  1 
NC,    RESEARCH   AND   DEVELOPMENT   PROGRAM  QUARTERL 
Y   PROGRESS  REPORT   FOR   THE   PERIOD   APRIL  -  JUNE 
1962  CVNA-156(K)      16.60  0839 

COMPUTER  PROGRAMS  FOR  OPTIMUM  START-UP  OF  NUCL 
EAR   PROPULSION  SYSTEMS 

TI0-16730(K)      11.10  0712 
DOSE-TIME-DISTANCE   CURVES   FOR  CLOSE-IN  FALLOUT 
FOR   LOW   YIELD   LAND-SURFACE   NUCLEAR  DETONATION 
S  PB1625i51Klll.50  0573 

EXTRUDED  CERAMIC  NUCLEAR  FUEL  DEVELOPMENT  PROG 
RAM  ACNP-62550(KI      14.60  0092 

FEASIBILITY  DETERMINATION  OF  A  NUCLEAR  THERMIO 
NIC   SPACE   POWER  PLANT 

AD-290  06eiK)  12.60  0031 
HIGH   -   ENERGY   NUCLEAR    PHYSICS    RESEARCH  PROGRAM 

AD-291    140IK)      11.50  0374 
HIGH-ENERGY  NUCLEAR  REACTIONS   OF   NIOBIUM  WITH 
INCIDENT   PROTONS   AND  HELIUK  IONS 

UCRL-10461(KI     12.25  0222 
INVESTIGATIONS  ON   THE   DIRECT   CONVERSION  OF  NUC 
LEAR   FISSION   ENERGY  TO  ELECTRICAL   ENERGY    IN  A 
PLASMA  DIODE  AD-290   727IK)      19.60  0385 

NUCLEAR   SUPERHEAT   DEVELOPMENT   PROGRAM  •' 

GNEC-254(K)    114.00  0386 
PRODUCTION  OF   TRITIUM  BY   CONTAINED  NUCLEAR  EXP 
LOSIONS   IN   SALT.    I.   LABORATORY   STUDIES  OF  ISOT 
OPIC   EXCHANGE   OF   TRITIUM   IN  THE  HYDROGEN-HATER 
SYSTEM  0RNL-3334(K)        1.50  0617 

STRIKING   EFFECT  OF   NUCLEAR  EXPLOSION 

AO-290  824(K)  121.00  0083 
THE   NUCLEAR  PROPERTIES  OF  RHENIUM 

AO-291  1801KI  11.50  0310 
VARIATIONS  IN  THE  TOTAL  ELECTRON  CONTENT  OF  TH 
E  IONOSPHERE  AFTER  THE  HIGH  ALTITUDE  NUCLEAR  E 
XPLOSION  NASA  N63-10485(K)      11.10  0142 

630A  MARITIME   NUCLEAR   STEAM  GENERATOR 

GEMP-150(K)      18.10  0349 
THE   ESTIMATION  PROBLEM   IN  NULL-ZONE  RECEPTION 
FEEDBACK   SYSTEMS  AD-290   325(K)    111.00  0599 

FUNDAMENTAL  SOLUTION  TO  THE  DIFFUSION  BOUNDARY 
LAYER  EQUATION  FOR  NEARLY  SEPARATED  FLOW  OVER 
SOLID  SURFACES   AT  VERY   LARGE   PRANDTL  NUMBERS 

AO-291  031(KI  12.60  0023 
LOCAL  PRESSURE  DISTRIBUTION  ON  A  BLUNT  DELTA  W 
ING  FOR  ANGLES  OF  ATTACK  UP  TO  35-OEGREES  AT  M 
ACH  NUMBERS  OF   3.4  AND  4.7 

NASA  N63-10800(K)  1.75  0516 
A  MAINTENANCE  PROGRAM  FOR  NUMERICAL  CONTROL  SY 
STEMS   ON  MACHINE  TOOLS 

II0-17375IK)      12.60  0809 
A   PRIORI    BOUNDS  ON   THE   S I  SCR E T I Z A T I  ON   ERROR  IN 

THE  NUMERICAL  SOLUTION  OF  THE  DIRICHLET  PROBL 
EM  AD-290   3221K)      14.50  0454 

NON-SIMILAR  NUMERICAL   METHODS   OF    SOLUTION  FOR 
ELECTRODE   BOUNDARY   LAYERS    IN   A  CROSSED  FIELD  A 
CCELERATOR  AD-290  525(K)      15.50  0185 

MANIPULATION  OF  AROUSAL  AND  ITS  EFFECTS  ON  HUH 
AN  VESTIBULAR  NYSTAGMUS  INDUCED  BY  CALORIC  IRR 
IGATION  AND  ANGULAR  ACCELERATIONS 

AD-290  348(K)  11.50  0252 
A  SAFETY  REVIEW  OF  THE  OAK  RIDGE  CRITICAL  EXPE 
RIMENTS  FACILITY  ORNL-TM- 349 1 K )  15.60  0612 
DRAG  OF  OBJECTS  IN  PARTICLE  -  LADEN  AIR  FLOW  P 
HASE  IV.  BLUNT  BODIES  AND  COMPRESSIBILITY  EFFE 
CTS  AD-291   178(K)      15.60  0752 

TONTO  FOREST   SE I SMOLOG I C AL  OBSERVATORY 

AD-291  148(K)  13.60  0815 
A  SAMPLE  TEST  EXPOSURE  TO  EXAMINE  CORROSION  AN 
0  FOULING  OF  EQUIPMENT  INSTALLED  IN  THE  DEEP  0 
CEAN  AO-291   049(K)      11.60  0582 

OCEANOGRAPHIC   CRUISE   TO   THE   BERING  AND  CHUKCHI 

SEAS,    SUMMER   1949.    PART    1   SEA  FLOOR  STUDIES 

PB  162  425(K)  12.60  0585 
OCEANOGRAPHIC  AND  UNDERWATER  ACCOUSTICS  RESEAR 
CH  AO-290  252(K)      12.60  0848 

OCEANOGRAPHIC   CRUISE   TO   THE   BERING  AND  CHUKCHI 

SEAS,  SUMMER  1949.  PART  IV.  PHYSICAL  OCEANOGR 
APHIC   STUDIES.   VOL.    I.   DESCRIPTIVE  REPORT 

PB  162  42e-l(K)  13.60  0584 
OCEANOGRAPHIC  CRUISE   TO  THE   BERING  AND  CHUKCHI 

SEAS,  SUMMER  1949.  PART  IV.  PHYSICAL  OCEANOGR 
APHIC   STUDIES.   VOL.    I.   DESCRIPTIVE  REPORT 

PB  162  428-llK)  13.60  0584 
OCEANOGRAPHIC  CRUISE   TO  THE   BERING  AND  CHUKCHI 

SEAS,  SUMMER  1949.  PART  IV  PHYSICAL  OCEANOGRA 
PHIC   STUDIES.    VOL.   2.    DATA  REPORT 

PB  162  428-2(K)  14.60  0585 
OCEANOGRAPHIC   CRUISE   TO  THE  BERING  AND  CHUKCHI 

SEAS,  SUMMER  1949.  PART  IV  PHYSICAL  OCEANOGRA 
PHIC   STUDIES.    VOL.   2.    DATA  REPORT 

PB  162  428-2(K)  14.50  0586 
PROCEEDINGS  OF  I NTER I  NOUS TR I AL  OCEANOGRAPHIC  S 
YMPOSIUM      (NO.   11,   BURBANK,   CALIFORNIA,    5  JUNE 

1952  PB   162   587{K)      12.60  0451 

RUBBER   ELASTICITY    IN  HIGHLY  CROSSLINKED  SYSTEM 


Figure  6.    Sample,  CEIR  Format  for  Office  of  Technical  Services 
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CHANGES  or  SLTCEhIA   IN  THE  UHglLICAL  VEIN  roLLdHlNS 

INTRAVENOUS  ADHINISTRATION  OF  SLUCOSE  TO  MOTHER.  • 

Z  K  STEhBERA.   J  HODR  •  CESK  STNEk  <ti*  P610>6>   OCT  99  CZ 

NODlFICAtlON  OF   THE  SLTCEHU  LEVEL.   PYRUVIC  ACID  LEVEL  AND  THE 
LEVEL  OF   INORSANIC  PHOSPHORUS  6T  APPLICATION  OF  GLUCOSE  DURING 
LABOR  UITh  CANSIDERaTION  TO  HTPOXIA  OF  THE  FETUS.   •  J  HODR. 
J  HER2H*NN.   J  JANDA  •  Z  GEBURTSH  GTNAEK   visa  fil-li  1999  GER 

EFFECTS  OF  THE  ADHINISTRATION  OF  SULFONAMIDE  BT  HAY  Of  THE 
EXOCRINE  OUCTS  ON  THE  GLTCEHIA   AND  HISTOLOGICAL  STRUCTURE  OF 
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P16B-9>   16  JAN  60 
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essentially  the  original  Luhn  format,  and  it  should  be  noted  in  this  connection  that  while 
Luhn  recognized  that  the  origin  of  the  KWIC  principle  lay  in  the  making  of  concordances, 
he  claimed  in  particular  the  use  of  machines  to  achieve  speed,  completeness,  and  accu- 
racy, and  a  novel  format.  ]_l 

The  most  common  variant  to  the  center  position  for  the  indexing  window  (or  keyword 
position)  is  at  the  left  or  the  beginning  of  the  line.    Netherwood's  selected  bibliography  of 
logical  machine  design,  which  is  probably  the  first  of  the  modern  permuted  title  indexes 
to  appear  in  the  open  literature,  used  the  left-most  positions  for  the  index  entry  word  in 
each  title  listing.     Slant  marks  were  also  printed  to  show  the  breaks  in  the  normal  order 
of  the  title  (Netherwood,  1958  [437]).  A  proposed  subscription  service,  advertized  in 
1958  but  never  actually  brought  into  operation,  would  also  have  used  the  left-hand 
position. 

In  these  left  position  examples,  the  keyword-in- context  principle  is  kept  only 
partially  intact  since  the  word  in  the  index  position  is  directly  adjacent  to  its  most 
specific  right-hand  context  ,  not  to  its  left-hand.    In  variations  such  as  developed  at 
Stanford  Research  Institute,  however,  the  index  word  is  extracted  from  its  context  and 
printed  separately  in  the  left-hand  margin,  with  the  title  in  its  normal  order  printed  to 
the  right.    This  type  of  variation  has  been  called  "KWOC",  for  keyword- out- of- context, 
and  is  illustrated  in  Figure  6,  which  shows  the  format  developed  by  C.E.I.  R.  ,  Inc.  for 
the  OTS  index  to  U.  S.  Government  Research  Reports. 

Table  1  lists  a  number  of  KWIC  index  projects  for  which  computer  programs  are  or 
might  be  made  available  to  interested  additional  users.    Computer  programs  have  been 
written  specifically  for  the  IBM  650,  704,   1620,   709,   7090,  and  7094  data  processing 
systems,  the  G.  E.  225  computer,  the  Deuce  Computer  in  England,  the  UNIVAC  1103  and 
1107  systems,  and  the  Japanese  computer  JEIPAC,  among  others.    In  addition,  some 
permuted  title  indexes  are  produced  manually,  or  with  the  use  of  simple  business  office 
machine  equipment.    For  example,  an  index  to  the  AIBS  Bulletin  for  1951-1961  has  been 
so  produced  by  the  American  Institute  of  Biological  Sciences.  _3/ 


1/ 

Private  communication,  excerpt  of  letter  from  H.  P.  Luhn  to  C.  L.  Bernier, 
December  27,   I960:    "With  respect  to  the  origin  of  the  KWIC  Index,  you  are, 
of  course,   right  that  it  is  a  form  of  concordance,  as  stated  in  my  original 
paper.    Furthermore,  keyword  indexing  has  been  practiced  in  various  forms 
as  far  back  as  a  hundred  years  ago.    All  of  these  methods  were,  however,  de- 
pendent on  manual  effort.    I  would  say  that  the  significance  of  the  present  KWIC 
Index  is  based  on  the  fact  that  it  is  produced  automatically  by  machine,  affording 
speed  of  compilation,  accuracy  and  completeness.    As  far  as  the  particular  format 
of  the  Index  is  concerned,  this  is  novel  to  my  knowledge,  in  accordance  with  in- 
formation I  have  been  able  to  ascertain  from  others.  " 

"PILOT- -a  permutation  index  to  this  month's  literature",  see  p.  8   and  Figure  1. 
A  left-most  window  full-title  format  was  developed  at  Stanford  University  in  co- 
operation with  the  IBM  San  Jose  Laboratories.    It  has  been  applied  by  the  Com- 
putation Center  to  the  titles  of  computer  programs  for  the  benefit  of  users  of  the 
Program  Library      Computation  Center,  Stanford  University,  "The  KWIC  Index", 
1963.    See  also  Marckworth,  1961  [393]. 

National  Science  Foundation's  CR&D  Report  No.   11,  [430],  p.  10;  Janaske,  1962 
[299];  Shilling,  1963  [550]  and  [551]  . 


48 


CO 
V 

it 


u 


cn 
nj 


o 

•r-H 

pr 

d 

•r-l 

W) 
•rH 

•H 

c 

oj 

<u 

•  iH 
>+H 

•rH 

am 

0 

(U 

60 

US 

d 


,d 
^  2 

O  ^ 


d 

o 

■H 
■H 

d 
ni 

O 

W 


ni  CO 


o 


o 


o 


c 


o 
o 


S  d  S 


CO 


0  is  d 

1  o  a) 


I— I 

d 

T3 

u 
(d 

d 

U5 


2P  d 


0) 

o 

dd 

I— 1 

<D 

<D 

m 

O 

d 
•i-i 

<A 

1— i 

U 

1 — 1 

U 

^1 

lU 

0) 

(U 

4-> 

u 

U 

ra 

O 

ra 

0 

t3 

ni 

d 

rd 

u 

u 

u 


CQ 

1—4 

d 

J 

nJ 

d 
nJ 


o 


-  «3 

d  a> 


T3 


-  -d  2: 

.  d 

j_>    O  (U 

a,  u  d 

(U    D  13 

CO  CO  1-3 


1—1 

-!-> 

d 
o 


CO 


in 
.—I 

<D  CO 


m    .2  2^ 


I-I 

,d 
d 


CO 


I— I 

rd 

d 
o 


I— I 

I— I 
ni 

d 
d 

< 


1—1 

r-l 

nj 

d 


s 

M 
6C 
O 
U 
Oh 

!h 
O 

X 

<u 

d 


d 
nJ 

•a 

ni 

SO 
O 

•fH 
r-H 

XI 
•  rH 

CQ 


0) 

d 

o 

nJ 

d 


ni 

"3  1-4 

™  nj  CO 

«  >  « 

«i  <U  nj 

.1^  •r'  ^1 

J  iJ  H 


ni 
O 
•  (-( 

a 


ni 
u 

•rH 
O 


r-l  to 

n!  <D 

u  < 


u 

l-J 

in 
< 


Ti 

d 


0) 

u 
o 

•  rH 

CQ 


,d 


to 
u 

(U 
CIh 

ni 
Ci, 

d 
nj 


C  to 

(0  ^ 

■H  r-H 

bX)  i-l 

^  H 

S  CQ 


d 

O  »H 

^  ni 


d  r 

ni  to 

o  £ 

to  'O 

to  5 

I— t  nJ 


^  d 
0  9 

3  . 

nl  n, 
(U  *^ 

CQ  . 

d 

>  -Z! 

u  nJ 

CO  o 


<: 
I— I 

ni 

U  0) 

a 

^H 

■d  (1) 

U  CO 


< 

I— H 

ni 

U  CO 


u 
«J 
u 

to 

1-4 

o 

•  1-4 

o 

f-4 

o 

•1-4 

CQ 


u 
nj 
U 
+-» 
to 

< 

r-4 

ni 
u 

•  rH 

OjO 

O 
1—4 

o 

•r-4 

CQ 


rO 

ni 
A 
(U 

d 
o 

-a 

<D  to 
(U 

•  H 
^H 

o 

H-> 

ni 
O 


49 


u 
c 


^  d  „ 


TT 

o 
U 


c 

0) 

Q 

xi 
C 

D  00 
QO  — < 
Tl  ^ 

l-l  PJ 
W  -I 


60 

•H 

0) 

a 
■•—1 

i 

U 
Q 
D 


in 


T3 
<U 
.1-1 

Tl 
O 

a 


in  ci 


A? 
o 

(1)    O  o 

CO  Q 


XI  " 


I  to 


0^  GO 


HH      -  1. 

o  ^  pc;  (X 


o 


o  o 


0) 

u 
Q 


Tl 
U 

a 

XI 

4->  CQ 


<U  M 

o  « 

u  <u 

nj  u 


o 

-  X) 

I   H  c 

H  S  ;r!  ^ 


I 


(in  ^ 


X5 


to 


M  ex  (U 

^^  o  > 

■H  !h  O  3 

w  a.  u  +i 


O  X) 
0) 

!-i  in 


XI 

""^ 

^   vO  vO 


M  4-> 

CX  O 

<  o 


GO 

o 


o 

'Z  (VI 

GO  U  <M 

n      d  ^ 

.ii    2;  -< 


u 

Ml 

o 

Oh 

Ih 
O 

X 
XI 

c 


CO 

c 

Ih 

u 


i-H  (3 

dj  l-H 

'  u 

tn  3 

-5  <u 


«J  a) 
,-1  o 

•H  >^ 

GO  ^ 

O  (X 


O  to 

■M  O 


X  c 

o  M 
C  n 

s 

^  S  6 

^11 


0) 

o 

-l-> 

c 

(0 

o 

•  iH 
r-H 

u 

1 

c 

•H 

1 

to 

x" 

rd 

de 

wo: 

a 
•i-i 

0) 

•i-H 
•4-> 

3^ 


o 

X3 


to 

=t 
o 

•H 

> 

(X  . 

4-.  X) 

O  0) 

C  o 
to  § 

c  o 
S  a 


c 

O  fH 

nJ  to 

GO  <U 

u  > 

o  « 

.s  ^ 

to 

to  5 

I— (  (0 


(I) 
O 


u 
cS 

C 

d  o 
-^^  _ 

a^  s  ^ 

a  §  I 

<  (n  2 


to 

(U 
(U 


I 

o 
o 


(0 


(U 

c  o 

fli  CO 

a  M 


43 
o 
^( 
nJ 
(U 
to 
0) 


a  § 

M  o 

<;  h 


CO 

a 
Q 


Hoc 
CO  o  <u 
<  Q  U 


u 

•l-l 

u 

(U 

r— 4 

GO  a 

c  o 
w  o 


50 


01 

pel 


OS 


d)  o 

.-H  [V) 

o  ,  , 

(U  vO 

0)  O 
CO 


o 

^  o 

1^  .2 


0) 


M  --J 

c  ™ 

to  <S 
1) 

o  . 

o  , 

u 

.2  ^ 


X 


U  o 


^  «S 
2  o 


tn  2 


m       T3  H  to  o  23 

QJ  O 


00  (U  CiO 


2 
Q  d, 


o 


o  B 
<u  E 

CO  CO 


£ 

o 
U 


in 

(M 
I 

w 
o 


u 

1—1 

w 

1-1 


— '  o 

O  CT^ 

^  o 


o 


S  ^ 

o 

(U  ^ 
60 

CO  o 


CQ 


CO  K 


£3 

CO  i-l 


sD 


to 
< 


t3 


60 

o 
u 
A 

o 

<u 

C 


>«  «1 

•j'  60 

M  o 
O  jj 
;;j  CO 

CQ  .2  ^ 

"S 

c  ^ 
c 


ci 
u 

c  o  « 
O  ,5 


o  >- 
60 
u 

o 

:3  W 

s  a 

2  o 
5  < 


U 

o 
<u 


X 
(U 

d 
I— I 

U 
I— I 


I 

o 


X 
I— I 

u 


a 


X 

T) 
C3 
I— I 

u 


iij  CO  U 


O  h 

5  «i 


(3 


6C 


60  V 

>; 

O  ^ 

HH  TO 


(u  Q 
1—1  " 

_  <u 


C!  S  O 
<0    O  Xi 

O  U  A 


a 

o 


a  -G  ^ 


n,  C 


to  C 
ttj  o 


to  > 

■r-t 


CO 


51 


C    O  1^ 

a)  ^  3 

.2  2  M 

s  3  1  -2 

c  ^  -  a 

a  d  a,  5^ 

h  <  u 


to 
d 

s  s 


S  ^ 

o 


CQ  ^  12 


c 

n) 

^1 

a 

pe 

tio 

up 

n) 

M 

•rH 

d 

O 

•  i-t 

Tl 

:3 

<D 

f— 1 

a, 

u 

c 

•iH 

on 

•H 

C/l 

>H 

d 

n 

o 

O 

4-> 

u 

<D 

t3 

u 

nJ 

a 

4-> 

o 

in 


o 


X) 

a, 
c 


d 

d 

o 

nj 

d 

len 

C 
Xl 

u 

C 

<u 

•H 
r— 4 

CO 

u 

Q 

X) 

d 

(X) 

u 
o 


u 

0) 

o 
U 


o 

o 
r- 


o 
o 


9 


^  ^ 

O  d 

o  (u 

-d  o 
u 


d  ^ 


o 

-  xt 

?i  d 
d  th 


n  d 


d  ° 
.H  rg 
CO  -< 


o 

-  T) 

%  d 


o 
I 

U 
O 


o 
d 


CP  00 


X3 

I— I  -u 

7!  to 

CQ  in 


I-) 


U 

01 

is  OJ 
"  d 


to 


XI 


B 

a 

u 

04 

o 

u 

Oi 

^1 

o 

(U 

d 


d 
o 


-a 
«) 

u 

6C 
O 

•  iH 
I— ( 

XI 

•  1-4 

CO 

< 


0)  I 

d  (S 

;3  a 

u  u 
nJ  O 

1-1 

2  S 

d 

«;  o 


Oh  (ti 

O  "-^  ^ 

to  Ql 

(u  d 

>  ni  d 

(U  ^^  o 

Q  H  -H 


d 
d 

B 

a 

o 
O 


„, 

^  ^ 


O  O 


o  U 


O  Tl 

d  0) 


(0 

u 

CO 

,d 
ft 


I 

u  t3 


O 
X 


u 
d 

t3  O 

u  a, 

o  <u 


d  ^ 


•  S3 

a  ^ 

CO  >-> 


CO 


en  nt 

(U   ^  u 

d  -I  CO 

O  T3 

U  (u  < 
■w 

O    O  (U 

o 

^  X  u 


 r 

d  o 

<D    >  g 

.2  c  »^ 

a  -2 

O  to  y 

^  a-g 

a  « 

o  H 


d 

O  f-i 

■Zl  o 

C  4-) 

nj  CO 

o  £ 

.a  o 

«  s 

CO 


0) 

«  CO 

^  X) 

(tJ  I-i 

d  ci 

o  XI 

•■^  d 

2  W 


d 
o 

u 
o 

tX  (i 
u  u 


o 
U 


.2  S 
Pi  o 


> 

•H 

d 

;-! 

o 

cd 


d 

CO  CO 


"i  1  i;^ 

O  ^  .2 

^  o  X 

'iH  •!-( 

ClJ  (tJ 

u  ;z  >> 

C  V  o 

•d  Tl  (Ti 

P  P£5  o 


CO 


52 


09 

(U  " 

sis 

0)  Oh 


O 

cl  >H 

nj  (u  o 

to    0)  T3  .JJ 


ft  S 

2 

o  o 
PQ  ft 


f-i  ^ 

0)  ^ 

C  o 

f-l  ^ 


t— 1 

u 


^  ^  ft 


I 


"      O  .rH 

^1  to 

3  r 

O  hn 


CO 


^1 

> 
Tl 


^1 

o 
O 


o 
o 
o 


o 


(U 


o 

-  Tl 

;=!  s: 


c  ^ 

«  O  ^ 

^  ^ 

I— 4  Co  ■!-> 

o  ^^  d 

U  (tJ  (1) 

c 
CO 


o 

-  Tl 
0)  - 


u 

0) 

o 
cti 
^H 
nj 

o 
I 

o 


Tl 

U 

<i 
Tl 

CO  i-i 


CO 

o 
u 

> 


CP 


(U 

CO 

CO 

-iH 

t-H 

»-H 

niti 

1— ( 

(U 
cti 

> 


c3 
u 
tit 
o 
I-I 

p. 

u 
o 

X 
<D 
TJ 

c 


E-t 


o 
ft 

<u 


T) 

o  o 


o 
ft 
d) 
05 

Tl 

<U 

tn 

CO 


T) 


I— I 
CO 


o 

o 

-<-> 

-l-> 

<i 

•  H 

O 

CO 

CO 

(U 

D 

O 

1-1 
I— 1 

X 

ft 

CO 

nde 

ym 

1—1 

CO 

Tl 

o 

d) 

m 

(U 

•  H 

4-> 

u 

s 

Li 

(Sp 

Pe: 

Sci 

O  »H 

•X3  o 

.2  M» 

Cd  CO 

IH  > 

O  « 

CO  -0 

CO  5 

I-I  nJ 


h  I 

O  (t) 

M-j  t3 

CTi  <5 

u 

o  o 

•  ■H  U 


CO 

u 
> 

P  C3 


O 


Tl 

n)  CO 
Pi  .Si 


^1  S  <^ 

^  o  ^  u 

c!  nj  (ti 

P  U  J  J 


CO 

nj 

CO 

a 

O 


CU 

> 

•H 


CO 

nJ  I 

nJ  X 
O 
m  >4-i 
O  O 


CO  CO 

u  u 

>  >  g 

^  ^  A 


ni 
U 

•  H 

Tl 

O 
•I-I 

U 

(U 

ft 

^  C 

CO  H 

0)  5 
^  U 


53 


In  addition  to  the  regularly  issued  KWIC  indexes  by  Biological  Abstracts,  Chemical 
Abstracts  Service,  the  American  Meteorological  Society  and  others,  a  large  number  of 
special  field,  one  time,  or  limited  collection  coverage  indexes  of  this  type  have  been  and 
are  being  produced  both  in  the  United  States  and  in  other  countries.    Well-known  examples 
include  the  programs  developed  at  the  Lawrence  Radiation  Laboratories,  University  of 
California,  which  simultaneously  produce  catalog,   cross-reference  and  subject  authority 
cards,  —  and  the  programs  developed  at  the  Bell  Telephone  Laboratories  from  1959  on- 
ward (Kennedy,   1962  [310]). 

Other  KWIC  indexing  efforts  cover  a  wide  variety  of  subject  matter.    In  the  field 
of  law,  applications  of  KWIC  type  indexing  include  work  on  the  legislation  of  the  50  states, 
a  joint  project  of  the  American  Bar  Foundation  and  the  Bobbs  -  Merrill  Company  (Eldridge 
and  Dennis,   1962  [  183],   1963  [  182]),  the  ninth  annual  edition  of  the  Index  to  Legal 
Theses  and  Research  Projects,  July  1962,  (Eldridge  and  Dennis,   1963  [l82]);  and  a  co- 
operative  program  between  the  libraries  of  the  Universities  of  Kansas  and  Oklahoma  to 
prepare  an  index  to  the  latter's  "Space  Law"  collection.  ^1  In  I960,  the  KWIC  Index  to 
the  Science  Abstracts  of  China  was  prepared  for  an  AAAS  Symposium,  (rienaerson,  ivbl 
[  2  63];  Farley,   1963  [  192]  ).    At  the  University  of  Kansas  Library  also,  the  Kansas 
Slavic  Index  is  being  produced,  with  coverage  of  3,000  articles  from  more  than  200  Slavic 
journals.         In  the  computer  technology  field,   Youden  (1963  [  659]  and  [  660]  )  has  com- 
piled KWIC  type  indexes  to  both  the  Journal  of  the  ACM  and  the  Communications  of  the 
ACM  and  the  Western  Periodicals  Company  offers  KWIC  indexes  to  the   proceedings  of 
the  Joint  Computer  Conferences  as  well  as  to  the  proceedings  of  other  conferences  and 
symposia  including  those  in  fields  of  electronics,  aerospace  and  quality  control.  4/  A 
special-purpose  application  is  in  the  use  of  a  KWIC-index  in  lieu  of  cross-references  in 
a  revised  edition  of  Current  Medical  Terminology.  ^1 

Examples  of  KWIC  indexing  projects  abroad  include  work  at  the  Japanese  Informa- 
tion Center  of  Science  and  Technology,  Tokyo,        an  index  "of  the  'Chemical  Titles' type" 
at  the  AU-Union  Institute  for  Scientific  and  Technical  Information  (VINITI)  U.S.S.  R.,]J 
an  information  journal  for  the  atomic  energy  field  being  prepared  at  the  Gmelin 
Institute,  (Koelwijn,   1962  [330]),  and  work  in  Great  Britain  both  at  the  English  Electric 
Company       and  the  IBM  British  Laboratories  (Black,   1962  [  65]). 

17 

Nation  Science  Foundation's  CR&D  Report,  No.  11,  [430],  p.  42. 
Ibid,  pp.  44  and  171. 

3/ 

Ibid,  p.  43;  University  of  Kansas,   1963  [  307]. 

4/ 

See  advertisements  in  journals  such  as  American  Documentation. 

5/ 

Gordon  and  Slowinski,   1963  [236],  p.  55. 

National  Science  Foundation's  CR&D  Report,  No.  11,  [430],  p.  120. 

7/ 

Mikhailov, 1962  [418],  p.  50. 

Dowell  and  Marshall,   1962  [  159],  p.  323;  Black,   1962  [65],  p.  3l6. 
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Trans-Canada  Air  Lines  —  is  using  a  KWIC  System,  and  at  the  EURATOM  ISPRA  labora- 
tories a  KWIC  type  program  has  been  developed  with  up  to  600 -character  context  and  a 
left-most  indexing  position. 

3.1.2    Advantages,  Disadvantages  and  Operational  Problems  of  KWIC  Indexing 

Luhn's  original  acronym,  KWIC,  is  peculiarly  apt  for  permuted  title  word  indexing. 
As  both  proponents  and  critics  have  noted,  the  resulting  product  may  be  relatively  crude 
in  terms  of  indexing  quality,  but  it  is  quick.     The  speed  achievable  both  by  elimination  of 
human  intellectual  effort  and  by  use  of  machine  (especially  computer)  processing  is  indeed 
the  major  single  advantage  of  this  type  of  automatic  indexing.    Closely  related,  however, 
are  the  advantages  of  currency  of  annoiincement  and  the  availability  of  these  indexes  for 
individual  use. 

Some  typical  claims  with  respect  to  speed  and  currency  are  as  follows: 

"The  permuted  index  was  invented  as  a  means  of  adequately  controlling 
(essentially,   of  indexing)  the  literature  without  further  intellectual  effort, 
and  thus  eliminating  indexing  delays.  " 

"The  great  merit  of  this  particular  method.  .  .  is  that  it  enables  information 
concerning  new  articles  to  be  made  available  very  much  more  quickly  than  if 
there  were  the  inevitable  delays  of  human  abstracting  and  indexing.  "  ^1 

"In  spite  of  the  disadvantages  which  are  pointed  out,  perhaps  the  greatest 
advantage  is  the  timeliness  and  the  speed  with  which  permuted-title  indexes 
can  be  prepared.  "  5^/ 

Specific  examples  of  high  speed  are  given  by  Biological  Abstracts,  where  one  hour's 
computer  time  suffices  to  prepare  and  arrange  entries  for  over  150,000  items.  ^1 
Kennedy  reports  for  the  Bell  Laboratories  System  that: 

"Editorial  scanning  is  very  fast;  only  several  lines  of  print  must  be  read  for 
each  report  and  the  required  text  markings  are  trivially  few.  Keypunching, 
the  largest  single  task,  takes  about  two  minutes  per  report.  .  .  Main-frame  time 
...  was  12  minutes  for  1703  reports."  Jj 

Ti 

Simons, 1963  [  556]  ,  p.  34. 

Meyer-Uhlenried  and  Lustig,   1963  [417],  p.  229. 
Tukey,   1962  [6ll],  p.  13. 

4/ 

Cleverdon,   1961  [ 125],  p.  108. 
Janaske,  1962  [299] ,  p.  3. 

^1 

See  Biological  Abstracts,   36:24,  p.  xii. 

7/ 

Kennedy,   1961  [31l],  p.  123. 
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Skaggs  and  Spangler  claim: 


"The  most  obvious  advantage  of  permuted  indexing  by  computer  is  speed.  In 
a  test  of  one  permuted  indexing  system,  input  of  3,000  punched  cards  contain- 
ing titles  and  running  text  produced  a  permuted  significant  word  index  of 
12,  190  index  entry  lines,  with  approximately  85  minutes  of  computer  time 
required  for  the  permuting  and  sort  operations.     The  output  was  printed  at 
some  500  lines  per  minute.  .  .  "  i./ 

In  many  cases,  greater  speed  and  timeliness  are  achieved  at  significantly  lower 
cost.     This  is  particularly  true  if  the  preparation  of  the  input  --  title,  author,  item 
identification  and  other  descriptive  cataloging  information--serves  multiple  purposes 
from  a  single    keystroking  operation.    Thus,  the  MATICO  System  provides  from  a  single 
input  (1)  KWIC  indexes  as  required,  (2)  selective  dissemination  notices  to  potential  users 
of  new  acquisitions,  (3)  records  on  magnetic  tape  for  the  information  retrieval  file,  and 
(4)  book  catalogs  covering  specialized  areas  of  the  collection,  all  at  a  net  savings  over 
previous  methods  of  $0.  39  for  each  title  processed.^/ 

Another  advantage  which  is  typically  claimed  for  KWIC  indexes  is  the  use  of  the 
author's  own  terminology.     The  display  of  different  words  as  they  have  been  used  in 
title  context  with  any  word  looked  up  introduces  "suggestiveness"  so  that  different  mean- 
ings and  different  browsing  clues  are  shown.    Kennedy  makes  the  following  typical  points: 

"The  use  of  the  author's  own  terms--the  alive  currency  of  new  ideas- - rather  than 
the  considered  reshapings  to  the  indexing  system  may  often  be  of  advantage.  The 
automatic  generation  as  index  entries  of  all  the  separate  words  in  multi-term 
concepts  is  definitely  so.    Access  is  direct,  under  any  one  of  the  component 
terms,  in  the  unrestricted  manner  of  Uniterm  indexing.    And  context  minimizes 
false  drops;  the  author  has  supplied  the  term  coordination.  "  _3/ 

Others,  however,   consider  some  of  these  same  factors  to  be  definite  disadvantages. 

In  general,  even  among  enthusiasts  of  KWIC,  there  is  more  agreement  as  to  the 
values  of  the  technique  as  a  device  for  current  awareness  scanning  and  as  a  dissemination 
index  than  for  its  use  for  more  extensive  searching.    It  was,  in  fact,  primarily  as  a 
dissemination  index  that  Luhn  first  proposed  the  KWIC  technique.    He  pointed  out  that 
such  indexes  could  be  prepared  with  minimum  effort  and  be  ready  for  dissemination  in 
the  shortest  possible  time,  justifying  publication  by  inexpensive  printing  means.    He  also 
noted  the  following  additional  advantages: 

TT 

Skaggs  and  Spangler,   1963  [557],  p.  30. 
Carroll  and  Summit,   1962  [  102],  p.  4. 

II 

Kennedy,    1962  [310],  p.  184. 
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"1.      Because  of  the  mechanical  method  of  preparation,  more  information 
may  be  displayed  than  would  have  been  practicable  by  conventional 
means . 

"2.      Keywords -in- context  permit  the  cros  s  -  correlation  of  subjects  to  an 
extent  not  realizable  by  conventional  procedures."  J_/ 

The  most  common  type  of  complaint  against  the  KWIC  indexing  method  is,  as  we 
have  noted  earlier,  identical  with  that  which  is  applied  to  word  indexing  in  general- -the 
lack  of  terminological  control.    Where  the  indexing  terms  are  restricted  to  those  used  by 
the  author  himself,  in  his  title  or  even  full  text,  there  arise  many  serious  problems  of 
synonyms,  near- synonyms ,  homographs,  neologisms,  and  eponyms .    The  effects  of 
machine  inability  to  resolve  these  problems  are  redundancy,   scatter  of  references 
throughout  the  index,   "haphazard  groupings",  2^/  and  retrieval  losses  because  the  user  is 
forced  to  guess  at  the  terminology  the  author  actually  used.         These  problems  are 
severely  aggravated  when  only  the  title  is  used  as  the  basis  for  index-word  extraction. 

Thus,  a  first  and  major  question  in  attempting  to  appraise  the  effectiveness  of  KWIC- 
indexing  techniques  is  that  of  the  adequacy  of  titles  alone  as  the  source  of  subject  content 
clues.    Spurred  on  at  least  in  part  by  the  existence  of  KWIC-type  indexes,  several 
investigators  have  studied  this  question,  with  somewhat  different  results.    Williams  has 
explored  for  some  years  the  possibilities  of  developing  systematic  procedures  for  title 
elaboration,  especially  making  explicit  information  that  is  implied.    Her  conclusions  are 
that  indexing  by  title  and  direct  elaboration  of  the  title  would  produce  index  information 
equivalent  to  that  found  in  Chemical  Abstracts  for  about  50  percent  of  the  documents 
studied,  but  that  other  procedures  would  be  required  for  the  remainder. 

Specific  studies  of  title  adequacy  for  a  particular  journal  or  field  have  been  under- 
taken by  both  the  American  Institute  of  Physics  and  the  Biological  Sciences  Communica- 
tions Project.    In  the  A.  I.  P.   experiments,  graduate  physics  students  were  asked  to 
locate  from  limited  clues  certain  specific  articles  appearing  in  The  Physical  Review,  and 
search  times  were  checked  for  their  use  of  permuted  title  and  other  indexes.  Another 
group  of  students  compared  the  subject  index  entries  in  Physics  Abstracts  and  Chemical 
Abstracts  with  the  words  in  the  titles  of  25  papers  from  The  Physical  Review.    In  the  case 
of  Physics  Abstracts,   69  percent  of  the  entries  for  these  papers  were  found  in  the  words 
of  the  title  and  63  percent  of  the  titles  contained  all  of  the  information  supplied  by  the 
set  of  index  entries.    In  the  case  of  Chemical  Abstracts,  the  corresponding  percentages 
were  47  and  23.  5^/  These  latter  findings,  for  the  chemical  index,  are  closely  corroborated 

Luhn,   1959,  [381],  p.  295. 

2/ 

Olney,   1963,  [458],  p.  44. 

See,  for  example,  Dowell  and  Marshall,   1962,  [159],  p.  324:    "This  problem  of 
'conceptual  scatter'  becomes  a  nightmare  when  highly  idiosyncratic  author 
language  is  used  as  a  basis  for  subject  indexing.  " 

Williams,  1961  [643]  ,  pp. 361  -363. 

5/ 

Maizell,  I960  [  392],  p.  126. 
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Bernier  ajid  Crane  who  report  that  for  the  non-organic  chemistry  items  covered  by- 
Chemical  Abstracts,   34  percent  of  the  entries  Ccun  be  derived  from  the  titles. 

With  respect  to  the  Biological  Sciences  Communications  Project  studies,  Shilling 
reports  as  follows: 

"Titles  of  scientific  articles  are  being  utilized  at  present  in  a  great  many  ways 
under  the  general  assumption  that  there  is  a  positive  correlation  between  the 
title  and  the  content  of  the  article.    A  study  was  undertaken  to  analyze  the 
accuracy  of  titles  in  describing  the  content  of  biomedical  articles.    It  was  conducted 
in  two  parts.    In  part  one,  a  group  of  scientists  were  asked  to  predict  the  content 
of  selected  scientific  articles,  in  their  area  of  interest,  from  the  title,  the  author's 
name,  and  the  name  of  the  journal  in  which  it  appeared.    The  results  of  the  first 
phase  of  the  study  on  the  first  trial  journal  were  so  diverse  as  to  make  analysis 
impossible,  and  this  part  of  the  study  was  not  pursued  further.    From  this  small 
segment  of  the  study  it  appears  that  scientists  are  deluding  themselves  when  they 
search  by  title  only  and  then  decide  what  they  wish  to  read. 

"In  the  other  half  of  this  experiment,  the  article  without  title,  author's  name,  or 
journal  name  was  sent  to  20  scientists,   selected  as  experts  in  the  scientific  field 
of  the  article,  who  were  asked  to  write  a  meaningful  title.    Fifty  articles  were 
used,  five  from  each  of  ten  selected  biomedical  journals.    From  this  part  of  the 
study  it  is  apparent  that  if  the  article  is  in  a  field  which  is  relatively  well 
standardized  amd  has  an  accepted  vocabulary,  it  is  possible  for  a  group  of  titlists  to 
agree  remarkably  well  on  an  appropriate  title.    However,  if  the  article  is  loosely 
organized,  contains  more  than  one  subject,  or  is  in  a  specialty  in  which  there  is 
no  standard  vocabulary,  then  titling  scientists  fail  to  agree  to  a  rather  alarming 
extent. 

Other  studies  involving  the  question  of  usefulness  of  titles  alone  for  indexing  purposes 
include  those  of  Doyle,  Lane,  Montgomery  and  Swanson,  O'Connor,  Ruhl,  Swanson,  and 
White  and  Walsh,  among  others.    Doyle  checked  the  retrieval  loss  likely  to  result  from 
the  synonymity- scatter  problem  for  a  permuted  title  index  compiled  in  1958  to  the  internal 
reports  of  the  System  Development  Corporation.    He  found,  for  example,  that  for  12 
direct  references  to  McGuire  Air  Force  Base,  there  were  one  to  "New  York  Air  Defense 
Sector",  two  to  "New  York  Sector",  ten  to  "NYADS"  and  five  to  "N.  Y.  Sector",  i./ 
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Bernier  and  Crane,   1962  [56],  p.  120. 
Shilling,  1963  [55l],  pp.  205-206. 
Doyle,  1961  [  166]  ,  p.  11. 
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Ruhl  (1963  [  506])  found  that  between  50  and  90  percent  of  author-prepared  titles  (the 
variation  depending  on  subject  field  and  other  circumstances),  did  fully  reflect  the  index 
terms  assigned  to  these  documents  by  human  indexers.    Lane  and  White  and  Walsh  have 
also  made  studies  directly  related  to  the  question  of  KWIC  index  effectiveness.    The  latter 
two  investigators  report  only  5Z  percent  retrieval  effectiveness  for  a  permuted  title  index 
to  the  Abstracts  of  Computer  Literature,   19 6Z,  which  they  attribute  to  the  changing 
terminology  in  the  still  new  field  of  computer  technology.  1.^  Lane  made  counts  of  titles 
that  would  be  "acceptable"  and  those  that  would  not  for  a  KWIC  index  for  50  titles  drawn 
from  each  of  10  published  indexes.    He  concluded  that,  if  there  were  judicious  pre-editing, 
technical  articles  in  the  technical  subject  indexes  could  be  quite  adequately  covered,  and 
papers  in  the  fields  of  law,  business,  and  the  humanities  somewhat  less  satisfactorily  so, 
but  that  for  the  material  indexed  in  the  Reader's  Guide  to  Periodical  Literature,  the  KWIC 
technique  would  fail  58  percent  of  the  time.  ^1 

Montgomery  and  Swanson  have  studied,   as  has  O'Connor  is  even  more  detail,  the 
adequacy  of  "machine-like  indexing  by  people".    Montgomery  and  Swanson  took  as  their 
test  corpus  the  September  I960  issue  of  Index  Medicus  and  found  that  for  4,  770  items, 
85.  8  percent  contained  either  the  word  itself  or  a  synonym  for  the  subject  heading 
assigned,  slightly  over  11  percent  did  not,  and  in  the  remaining  cases  the  investigators 
could  not  clearly  decide.    They  concluded,  therefore,  that:    "Most  of  the  articles  studied 
could  have  been  indexed  by  machine  on  the  basis  of  machine  'inspection'  of  article  titles 
alone.  "  ^1  O'Connor,  however,  typically  reports  that  of  a  random  sample  of  50  papers 
manually  indexed  under  the  term  "Toxicity",  five  had  titles  which  contained  the  word 
"toxic"  or  the  word  "toxicity"  and  34  had  titles  which  were  not  even  indirectly  connected 
with  the  term.  ([443],  [444],  [445],  [447]  and  [448]).  With  respect  to  the  Montgomery- 
Swanson  conclusions  as  such,  Carlson  raises  the  further  critical  questions  of  over- 
assignment  and  false  drops  and  suggests  that:    "a  simple  machine  processing  of  titles 
would  give  us  way  too  much  or  practically  nothing.  "  4/ 

Research  activities  at  the  American  Bar  Foundation  have  included  checking  of 
KWIC  type  indexing  of  several  thousand  legal  articles  with  the  subject  headings  assigned 
\inder  the  "Index  to  Legal  Periodicals"  system     (Kraft,   1962  [  333]).    It  is  reported  that: 

White  and  Walsh,   1963  [639],  p.  346. 
Lane,   1964  [  345],  p.  46. 

11 

Montgomery  and  Swanson,   1962  [42l],  p.   359.    In  another  study  (1962  [534]  ,  p.  468), 
Swanson  reports  findings  for  several  thousand  entries  in  classified  bibliographies 
where  approximately  90  percent  of  the  sampled  items  contained  title  words  that  were 
identical,  or  similar  in  meaning,  to  the  subject  headings  under  which  they  were 
indexed.    He  notes,  however,  that  similar  results  could  have  been  produced  by 
machine  processing  with  the  significant  proviso  that  the  machine  have  available  an 
adequate  synonym  dictionary  or  thesaurus. 

G.  Carlson,   1963  [lOO],  pp. 328-329. 
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"Interpretation  of  data  revealed,  among  other  things,  that  64.4  percent  of  the 
title  entries  contained  as  keywords  one  or  more  of  the  ILP  subject  heading  words 
under  which  they  were  indexed,  and  25,  1  percent  contained  logical  equivalents. 
The  remaining  10.  5  percent  of  the  title  entries  had  non-descriptive  titles.  "  i./ 

The  difficulties  with  titles  as  sources  of  the  indexing  information  stem  from  at  least 
three  distinct  types  of  determining  factors:    (1)    the  language  habits,  background, 
interests,  and  idiosyncracies  of  the  author;  (2)  the  interests,  familiarity  with  the  subject 
matter,  language  habits,  imagination,  and  idiosyncracies  of  the  user,  and  (3)  factors 
largely  extrinsic  to  either  the  particular  author  or  the  particular  user.    In  the  first  case, 
we  find  especially  the  problem  of  the  witty,  punning,   deliberately  non-informative  title, 
the  so-called  "pathological  title".    Janske  gives  the  provocative  example,  in  the  literature 
of  information  selection  and  retrieval  itself,  of  "The  Golden  Retriever".  ^1  Even  in  the 
non-pathological  case,  however,  there  is  the  serious  question  of  whether  the  author  him- 
self is  likely  to  be  a  good  indexer.A^ 

On  the  user  side,     the  normal  critical  problems  of  "bringing  the  vocabulary  of 
indexer  and  searcher  into  coincidence"  (Bernier,   1953  [  55])  are  aggravated  by  the  facts 
that  the  user  of  KWIC  must  anticipate  the  terminology  used  by  a  large  number  of 
different  "indexers"  (i.e.,  the  authors),  that  title  words  spelled  the  same  but  with  quite 
different  meanings  in  different  special  applications  are  grouped  together  in  the  same 
place  in  the  index,  and  that  the  same  concepts  may  be  expressed  in  quite  different 
phraseology  depending  on  the  author's,   rather  than  the  user's,  field  of  specialization.  To 
these  aggravating  circumstances  there  must  be  added  in  turn  the  psychological  accept- 
ability to  the  individual  user  of  the  scatter  and  redundancy,  to  say  nothing  of  the  format 
and  legibility,  of  a  particular  published  index. 

Such  factors  affecting  the  particular  user  will  of  course  vary  with  the  nature  and  pur- 
post  of  his  search.    Kennedy  points  out,  for  example,  that  the  location  of  a  document  from 
only  a  single  clue,  a  single  title  word,  is  particularly  easy  with  a  permuted  title  index 
and  he  emphasizes  that  the  "index  purpose,  use,   size,   statement  and  array  are  other 
factors  of  considerable  moment  in  judging  the  value  of  title  indexes".  — ^ 


National  Science  Foundation's  CR&D  Report  No.  11,  [430],  p.  62. 
Janaske,   1962  [299],  p.  4. 

II 

See,  for  example,  a  report  on  a  conference  on  better  indexes  for  technical  literature, 
ASLIB  Proceedings,   13:4,  April  1961,  with  a  number  of  statements  on  the  author  as 
a  poor  indexer.    See  also  Crane  and  Bernier,   1958  [144],  p.  515:    "Not  even  authors 
are  qualified  to  index  their  own  work  unless  they  are  equipped  for  the  task  by  train- 
ing and  experience". 

Kennedy,   1961  [31l],  p.  125. 
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A  major  question  in  the  area  of  user  acceptability,  however,  is  that  of  the  adequacy 
of  title  alone  to  tell  the  searcher  whether  or  not  a  specific  document  is  relevant  to  his 
query  or  interest.    A  number  of  investigators,  both  documentalists  and  user-scientists, 
suggest  that  this  is  rarely  the  case.  }_l  In  fact,  for  many  users,  titles  alone  provide  only 
a  negative  searching  device- -in  an  announcement  bulletin  or  abstract  journal  the  user's 
scanning  of  titles  merely  tells  him  whether  or  not  he  should  read  the  abstract  and  then 
perhaps  go  on  to  the  paper  itself. 

It  is  for  reasons  of  this  type,   in  all  probability,  that  Montgomery  and  Swanson  found 
less  effectiveness  of  titles  on  relevance- judgment  tests  than  might  be  suggested  by  their 
more  optimistic  findings  as  to  the  success  of  machine  procedures  for  replicating  human 
subject  heading  assignments.     Whereas  they  have  claimed  that  about  90  percent  of  test 
items  could  have  been  as  successfully  indexed  by  machine  as  by  manual  procedures, 
(Montgomery  and  Swanson,   1962  [  421];  Swanson,   1962  [584]),  they  have  also  reported 
that:    "Comparison  of  title  relevance  judgment  with  judgment  based  on  full  text  examina- 
tion indicates  that  titles  are  only  about  one-third  effective  (i.  e.  ,  two-thirds  of  the  relevant 
articles  would  be  judged  irrelevant)  as  the  basis  for  estimating  the  relevance  of  the 
article  to  a  given  question".^/   They  go  on  to  suggest,  therefore,  that  ".  .  .indexing  should 
be  based  on  more  than  titles  and.  .  .  a  bibliographic  citation  system  should  present  to  the 
requester  something  more  than  titles.  "       Similarly,  Jahoda  reports  in  an  analysis  of  281 
actual  search  requests  at  Esso  Research  and  Engineering  that  only  two-thirds  could  have 
been  answered  with  a  shallow  index  based  on  titles  and  major  section  headings  of  the 
documents     and  that  answering  the  remainder  of  the  requests  would  have  required  an  index 
of  considerable  depth.  ^/ 

The  obvious  factors  affecting  the  utility  of  titles  as  the  source  of  indexing- s earching 
clues  include,  first,  the  limitation  of  most  titles  to  the  principal  subject  matter,  the  main 
topic  or  topics  of  the  document.     The  display  of  title  context  does  to  some  extent  provide 
for  modifications  of  the  topic  to  the  special  aspects  treated,  but  it  is  of  course  obvious 
that  a  title  cannot  possibly  provide  clues  to  subject  content  not  implied  in  the  words  of  that 
title.    In  many  cases,  the  potential  user  wants  information  contained  in  the  paper,   or  even 


See,  for  example,  Atherton  and  Yovich,   reporting  on  evaluations  by  physicists  of 
experimental  citation  indexing,   1962  [26],  p:  22:    "The  reliance  on  titles  of  papers 
for  retrieval  purposes  was  not  sufficient";  Levery,   1963  [  359],  p.  235.  "Titles  are 
usually  insufficient  to -furnish  a  correct  index  to  the  text";  H&cken,   1962  [274],  p.  93: 
"The  titles  were  not  explicit  enough";  Crane  and  Bernier,   1959  [l45],  p.  1053: 
"Lists  of  titles  can  be  prepared  rapidly,  but  they  are  inadequately  useful  in  selecting 
articles  of  interest,  and  they  provide  little  or  no  directly  usable  information"; 
Dowell  and  Marshall,   1962,  [l59],  p.  324:    "Frequently  titles  either  lack  sufficient 
detail  or  are  in  fact  misleading";  Connolly,   1963  [l36],  p.  35:    "Most  titles  are 
inadequate  as  descriptions  of  the  contents  of  papers.  " 

Montgomery  and  Swanson,  1962  [42l],  p.  364. 

Ibid,  p.  366. 

Jahoda,   1962  [298],  p.  75. 
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in  its  appendices,  which  was  not  the  principal  concern  of  the  author  and  may  not  even  have 
been  considered  significant  by  him.    The  claim  that  the  author,  who  knows  his  own  subject 
best,  has  already  indexed  his  paper  best  by  his  choice  of  words  and  emphasis  in  text,  and 
especially  in  his  title,  is  pertinent  only  to  that  main  subject  to  which  he  addresses  himself, 
not  to  the  other  potentially  useful  information  which  he  may  also  disclose. 

Other  extrinsic  factors  affecting  title  adequacy  and  hence  the  effectiveness  of  title- 
indexes  are  the  size  and  the  relative  homogeneity  or  heterogeneity  of  the  collection  or  set 
of  documents  so  indexed,  the  breadth  or  narrowness  of  the  subject  field  or  fields  covered, 
the  time  period  covered  and  whether  for  one  or  many  fields.    Whether  or  not  material  in 
more  than  one  language  is  included  is  a  special  factor.     These  various  factors  interact  in 
various  ways,  usually  with  disadvantageous  effects  when  even  the  most  "nondescript" 
human  indexer  (that  is,  one  who  accepts  only  words  from  the  text  itself)  is  replaced  by 
"a  keypunch  operator  whose  job  it  is  to  convert  the  keywords  into  machine  -  readable  form, 
and  a  machine  whose  job  it  is  to  assimilate  machine- readable  text  and  print  out  its  per- 
mutations with  each  significant  word  serving  as  an  access  point.  "  \J 

The  difficulties  of  subject  scatter,    synonymy,  homography,  redundancy,  and  the 
like,  however,  will  also  occur  in  human  indexing  that  relies  heavily  on  title  only,  which 
is  perhaps  more  frequently  the  case  than  is  generally  recognized,  _^  just  as  much  as  for 
machine-generated  indexes  involving  the  permutations  of  keywords  in  titles.    Such  dis- 
advantages must  therefore  be  balanced  not  only  against  the  advantages  of  speed,  timeliness, 
having  an  index  announcement  tool  personally  available  at  low  cost,  and  the  like,  but  also 
against  the  probability  of  obtaining  as  useful  a  tool  within  the  limits  of  available  human 
indexing  resources  and  justifiable  costs.    Cleverdon,  for  example,  comments  as  follows: 

"There  are  those  who  would  say  that  this  [KWIC]  can  in  no  way  be  called  indexing, 
and  that  the  value  of  such  indexing  must  be  very  much  lower  than  that  done  by 
intelligent  trained  human  beings.     This  is  a  comfortable  thought,  but  such  small 
evidence  as  is  at  present  available  makes  it  appear  doubtful  as  to  whether  it  is 
entirely  true.     This  is  not  to  say  that  a  human  being  cannot  do  a  better  job,  but  it 
certainly  appears  likely  that  the  cost  of  employing  a  human  being  to  do  it  is  of 
doubtful  economic  value.  "  3/ 


II 

Herner,  1962  [266],  p.  4. 

y 

See,  for  example.  Moss,   1962  [425],  p.  39:    "I  am  convinced  that  a  great  many  of 
the  UDC  and  other  numbers  which  are  provided  on  millions  of  cards  in  technical 
libraries  up  and  down  the  country,   and  which  look  so  erudite,  are,   in  fact,  no  more 
than  cards  transliterating  titles,  with  occasionally  similar  transliteration  of  a  few 
randomly  chosen  words  from  the  abstracts  as  well.  .  .  We  are,  in  effect,  already 
largely  using  title  indexing  and  complicating  it  unnecessarily  by  magic  numbers.  " 
See  also  Crane  and  Bernier,   1958  [l44],  p.  514:    "Some  indexes  to  periodicals, 
particularly  word  indexes,  are  merely  indexes  of  titles  of  papers  or  of  abstracts.  " 

Cleverdon,   1961  [l25],  pp. 107-108. 
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It  is  also  of  interest  to  note,  moreover,  that  the  very  existence  of  machine -generated 
permuted  title  indexes  should  greatly  increase  the  likelihood  that  authors  will  use  better 
and  more  useful  titles.   \_l    At  a  seminar  on  word  and  vocabulary  byproducts  of  permuted 
title  indexing  held  at  Biological  Abstracts  headquarters  on  October  8,    1962,  Rigby  of 
Meteorological  and  Geoastrophysical  Abstracts  reported  informally  that  as  of  that  time 
there  was  already  discernible  improvement  in  titles  covered  by  their  KWIC  index.    In  the 
same  year  (1962),   Tukey  similarly  stated  that:    "Chemical  Titles  has  been  heavily  enough 
used  to  affect  the  construction  of  titles  of  papers  on  chemical  subjects.  "  2/  Instructions 
to  authors  of  the  previously  mentioned  "Short  Papers"  3/  for  the  A.D.I.   1963  Annual 
Meeting  specified  that  at  least  six  significant  words  should  be  included  in  their  titles  and 
nearly  all  authors  did  in  fact  comply.    Two  of  the  "Short  Papers"  are  specifically  directed 
to  the  topic  of  improvements  that  authors  can  make  in  writing  their  titles  (Brandenberg, 
1963  [80];  Kennedy,   1963  [312]). 

Instructions  of  this  type  can  be  effectively  used  for  situations  where  all  authors  are 
under  the  same  administrative  control,  as  in  the  internal  reports  prepared  in  a  single 
organization.    This  type  of  situation,  incidentally,  is  one  for  which  KWIC  proponents  are 
often  most  enthusiastic  (Kennedy,   1962  [310];  Black,   1962  [65];  Linder,   I960  [362]). 
Finally,  there  is  considerable  promise  that  pressures  brought  to  bear  by  journal  editors 
of  the  publications  of  professional  societies,  notably  the  American  Institute  of  Chemical 
Engineers  and  other  cooperating  member  societies  of  the  Engineers  Joint  Council,  will 
result  in  improved  adequacy  of  titles  and  thereby  increased  effectiveness  of  title  word 
indexe  s . 

Certain  other  disadvantages  of  KWIC  indexing  techniques,  however,  relate  specif- 
ically to  operational  problems  and  requirements  in  the  machine  production  of  these  indexes 
There  is,  first,  the  problem  of  the  amo\mt  of  context  that  is  usually  displayed- -that  is,  the 
question  of  line  length- -and  the  related  problems  of  title  truncation  and  wrap-ar o\md.  As 
Kennedy  notes:    "Progressive  shifting  of  the  title  to  bring  a  given  word  to  the  indexing 
column  frequently  causes  portions  of  the  title  to  exceed  the  line  space  available,  first  at 
the  right  margin,  then  the  left,  or  even  both  simultaneously.  "  4/    A  case  in  point  is  the 
perhaps  apocryphal  "EROTIC  TENDENCIES  AMONG  TRAPPIST  MONKS"  where 
"ATHEROSCL"  had  been  dropped  off  at  the  left. 

For  multi-column  KWIC  indexes,  in  particular,  where  the  line  length  is  typically 
58-60  characters,   "much  of  the  relevance  is  lost  because  the  reader  sees  the  wrong  slice 
of  the  title".  5_/    The  Bell  Laboratories  KWIC  index,   6_/  Chemical-Biological  Activities,  T_/ 


1/ 

See  for  example,  Black,   1962  [65],  p.  317;  Youden,    1963  [658],  p.  332. 
Tukey,   1962  [6ll],  pp.  9-10. 
Luhn,   1963  [376]  and  [377]. 
Kennedy,   1961  [311],  p.  117. 
Brandenberg,   1963  [80],  p.  57. 

A/ 

Kennedy,    1961  [311],  p.  118. 

z/ 

Figures  4  and  5. 
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and  Youden's  indexes  to  ACM  papers  (1963  [659]  and  [  660]  )  illustrate  single- column 
formats  that  alleviate  this  problem  by  extending  the  title  line  to  103-106  characters,  ex- 
clusive   of  the  identification  code.     Youden  has  calculated  that  for  the  titles  in  the  field  of 
computer  literature  which  he  analyzed  30  percent  of  the  titles  would  have  been  truncated 
in  60-character  title  line  formats,  but  that  only  2  percent  would  have  been  chopped  by  103- 
character  title  length  limits.  !_/ 

A  second  disadvantageous  effect  of  machine  production  requirements  in  most  KWIC 
indexes  is  the  tedious  sequential  scanning  necessary  because  of  the  unbroken  organization 
of  the  page  format  and  the  long  blocks  that  occur  for  frequently  occurring  word  entries. 
Doyle  (1959  [l68],   1961  [l66])  has  investigated  this  problem  of  block  length  and  suggests 
either  that  alphabetization  be  carried  out  to  the  iwords  following  those  in  the  indexing 
window  or  that  the  entries  in  the  block  be  permuted  also  in  a  second-order  cycle.  The 
latter  suggestion  has  the  advantage  of  facilitating  any  two-term  coordinate  indexing 
type  of  search,  "because  one  can  now  look  up  directly  any  pair  of  subject  words,  regard- 
less of  whether  or  not  they  occur  adjacently  in  a  sentence.  "  ^1 

Redundancy  in  KWIC  indexes,  which  aggravates  the  sequential  scanning  and  the  long- 
block  fatigue  effects,  is  in  large  part  the  result  of  difficulties  in  establishing  the  most 
appropriate  bounds  for  exclusion  or  "stop"  lists.    We  have  previously  distinguished 
machine-generated  indexes  of  the  derivatiye  type  from  certain  of  the  machine-compiled 
indexes  primarily  on  the  basis  that  in  the  first  case,  the  criteria  for  determining  the 
significance  of  the  keywords  to  be  used  as  the  index  access  points  are  applied  auto- 
matically during  the  machine  processing,  even  if  the  selectivity  so  achieved  is  only 
"negative  selectivity.  "       The  amount  of  index  entry  redundancy,  of  too  many  entries 
and  of  irrelevant  entries  is,  in  simple  KWIC  indexing,  a  direct  function  of  the  length  and 
contents  of  the  stop  list. 

In  Luhn's  original  proposals  for  both  KWIC  and  other  types  of  automatic  indexing, 
he  pointed  out  the  importance  of  the  rules  which  must  be  established  in  order  to 
differentiate  the  significant  words  from  the  nonsignificant.    He  says,  for  example: 

"Since  significance  is  difficult  to  predict,  it  is  more  practicable  to  isolate  it 
by  rejecting  all  obviously  nonsignificant  or  'common'  words,  with  the  risk  of 
admitting  certain  words  of  questionable  value.    Such  words  may  subsequently  be 
eliminated  or  tolerated  as  'noise'.    A  list  of  non- significant  words  would  include 
articles,   conjunctions ,  prepositions ,  auxiliary  ve rbs ,  certain  adjectives,  and  words 
such  as  '  report' , 'analysis  ' ,   'theory',  and  the  like."  ^/ 

W.  W.  Youden,   1963  [458],  p.  331. 

2/ 

Doyle,  1961  [  166],  p.  13. 
Artandi,  1963  [20],  p.  15. 
Luhn,   1959  [ 38l] ,  p. 289- 


64 


Interesting  variations  are  to  be  noted  in  the  current  practices  of  using  stop  lists. 
Some  lists  are  quite  short,  and  others  extend  to  several  thousand  words.    Parkins  reports 
that  a  mere  14  words  on  the  stop  lists  used  for  B.  A.  S.  I.  C.  are  responsible  for  80  percent 
of  the  title  lines  that  need  not  be  printed,  but  that  their  original  list  of  200  stop  words  grew 
quite  rapidly  to  more  than  1,  000  now  in  use.   1_/  Chemical  Abstracts  Service  representatives 
reported  in  1962  an  initial  list  of  about  1,  000  words  which  dropped  to  300  at  one  time  and 
then  was  increased  again  to  the  original  level.  — ^  Using  a  stop  list  of  82  words  eliminated 
30  percent  of  a  42,  000-word  corpus  of  internal  reports  at  the  System  Development 
Corporation,  (Olney,   1961  [456]). 

Critical  questions  in  the  establishment  of  stop  lists  relate  to  the  problem  of  balancing 
the  economics  of  the  number  of  title  lines  to  be  printed  and  to  be  subsequently  scanned 
against  the  loss  of  retrieval  effectiveness  if  certain  words  are  omitted  from  the  search 
entry  positions.    How  this  balance  should  be  achieved  may  vary  from  one  subject  field  to 
another  and  between  different  organizations.    In  several  regularly  published  KWIC  indexes, 
the  actual  list  used  to  exclude  the  presumably  nonsignificant  words  is  printed  so  that  the 
user  can  check  before  proceeding  to  actual  search.    Williams  has  suggested  that  each 
excluded  word  be  listed  once,  in  its  proper  alphabetic  place  in  the  index,  if  it  occurs  in 
the  titles  of  the  particular  set  of  items  being  indexed.  ^1 

In  general,  however,  not  enough  is  yet  known  about  the  requirements  of  particular 
subject  fields  and  particular  types  of  organization  to  arrive  at  the  most  effective  compro- 
mises in  establishing  exclusion  lists  for  keyword  indexing.    Noting  that  stop  lists  in 
actual  use  vary  from  only  a  few  function  words  such  as  prepositions  and  conjunctions  to 
lists  several  hundred  words  long,  Brandenberg  points  out  that: 

"At  the  present  state  of  the  KWIC  indexing  art  the  selection  of  stop  words  appears 
to  be  largely  arbitrary  and  a  comparison  of  half  a  dozen  stop  lists  shows  that  they 
have  about  two  dozen  words  in  common.  "  ^/ 

Kennedy  and  Doyle  both  specifically  suggest  that  more  research  on  the  contents  and 
effects  of  stop  lists  is  necessary,  (Kennedy,   1961  [31l],   1962  [  310]  ;  Doyle,   1963  [  162]), 
but  Kennedy  points  out  the  ease  with  which  the  machine  programs  themselves  can  be  used 
for  modification  of  the  lists.  ^/ 


1/ 

Parkins,   1963  [466],  p.  27. 

i/ 

F.  A.  Tate,  discussions  at  seminar  on  the  word  and  vocabulary  byproducts    of  per- 
muted title  indexing,   Biological  Abstracts  headquarters,  October  8,   19  62. 

T.  M.  Williams,  discussions  at  seminar  on  word  and  vocabulary  byproducts  of  per- 
muted title  indexing.   Biological  Abstracts  headquarters,  October  8,  1962. 

Brandenberg,   1963  [80],  p.  57. 

See  also  Clark,  (1960  [l23],  p.  459),  who  suggests:    "It  is  very  probable ...  that  the 
cut-off  points  [for  most  common,  for  very  infrequent,  words]  will  have  to  be  adjusted 
to  the  material  we  actually  use.    The  effect  on  the  process  of  such  factors  as  style, 
size  of  text,  the  complexity  of  the  subject  matter,  and  the  like,  is  as  yet  not  clearly 
seen.    The  collection  of  large  amounts  of  text  and  their  analysis  will  undoubtedly  be 
the  best  way  of  determinin  gthe  effects  of  these  variables.  " 
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Some  of  the  reasons  for  keeping  stop  lists  short,  however,  may  reflect  unnecessary 
programming  difficulties.    Turner  and  Kennedy  have  reported  that  in  the  SAPIR  system  a 
title  word  is  compared  only  with  the  group  of  nonsignificant  words  that  have  the  same 
number  of  characters,  in  order  to  reduce  the  machine  time  required  for  the  exclusion 
list  search,  i./  Skaggs  and  Spangler  give  an  account  of  an  exclusion  list  system  developed 
for  general  text  processing  as  follows: 

"A  representative  form  developed  by  General  Electric  is  composed  of  three  groups 
of  words,  high  frequency,   special  and  standard.     The  high  frequency  words  (Z5) 
occur  most  frequently  in  English  text.    A  compression  of  approximately  35  percent 
will  occur  for  most  kinds  of  text  when  these  25  words  are  deleted.     The  special 
words  are  derived  from  the  particular  body  of  text  being  processed.     The  com- 
position of  this  group  is  left  to  the  program  user.    Normally  the  words  for  this 
group   are  selected  by  making  an  Editing  list  in  alphabetical  sequence.    The  words 
appearing  in  the  index  position  on  the  preliminary  listing  are  then  reviewed. 

"Standard  words  are  words  that  occur  with  a  relatively  high  frequency  in  most 
types  of  text  and  therefore  are  appropriate  for  a  general  purpose  screen.    In  the 
GE  program,   375  words  are  used  in  this  group. 

"To  minimize  computer  processing  time,  it  is  desirable  that  words  in  the  Ex- 
clusion Dictionary  be  arranged  in  approximate  order  of  their  frequency  of 
occurrence."  ^/ 

It  should  be  noted,  however,  that  in  most  cases  stop  list  searches  can  be  programmed  in 
the  form  of  so-called  "logarithmic",  "partitioning"  or  "bifurcation"  searches  in  which 
the  number  of  machine  operations  required  is  only  log2  N  +  1,  where  N  is  the  number  of 
words  in  the  list. 

The  more  words  excluded,  the  fewer  the  title  entry  lines  that  must  be  included  in 

the  final  index.     This  is  a  factor  involving  first  of  all  the  user  in  the  sequential  scanning 

he  must  do,  where,  as  Coates  has  remarked,  the  retrieval  effectiveness  is  usually  in 

3  / 

inverse  proportion  to  the  amount  of  such  scanning  required.  _'    Secondly,  longer  stop  lists 
help  to  minimize  the  long  block  problem,   since  it  is  obviously  the  most  frequently 
occurring  title  words  that  have  not  been  excluded  that  cause  the  longest  blocks  of  entries. 


Turner  and  Kennedy,   1961  [614],  p.  7. 
Skaggs  and  Spangler,   1963  [  557],  p.  Z9. 
Coates,  1962  [134],  p.  430. 
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The  important  economic  factor,  however,  is  the  total  number  of  lines  to  be  printed  in  the 
index,  which  is  directly  reflected  in  page  costs.     The  effects  of  page   costs,   in  turn, 
engender  compromises  in  printing  quality,   such  as  page  format  and  size  of  type.  These 
are  among  the  serious  unresolved  problems  that  affect  user  acceptance  of  KWIC  indexes 
and  involve  questions  of  format,  legibility,   character  sets,   and  size  of  the  index. 

In  general,  however,  in  the  present  state  of  the  art  of  KWIC  indexing,   the  consensus 
seems  to  be  that  of  qualified  praise,  especially  for  the  early  announcement  and  dis- 
semination  applications.     The  KWIC  index  is  recognized  as  responding  to  a  definite  need,— 
as  having  merit  for  fields  in  whiph  more  conventional  indexes  do  not  exist  as  well  as  for 
current  awareness  searching, 2/   as  receiving  excellent  response  from  users  "because 
they  can  take  a  handy  booklet,   sit  down  at  a  table  and  look  under  the  words  they  know  and 
use,   and  which  they  expect  other  engineers  to  use  in  titles.  "  ^/  Bernier  and  Crane,  after 
considering  comparative  effectiveness  data  for  subject  as  against  word  indexing,  come  to 
the  following  conclusions: 

"Title  lists  keyed  by  words  have  value  for  quick  distribution  and  fast  use  since  time 
is  often  a  very  important  element  in  the  obtaining  of  information.     Such  lists  do  not 
serve  adequately  for  thorough  searching.   ...  A  title  concordance  may  be  more  use- 
ful than  would  seem  from  the  .  .  .   data  on  index  entries.    However,   it  must  obviously 
be  incomplete,  must  have  many  unnecessary  entries,   and  would  not  prove  suggestive 
enough  to  users  who  lack  background  in  the  subjects  sought.  " 

Additional  benefits  can  quite  readily  be  obtained  by  taking  advantage  of  the  biblio- 
graphic information  once  it  is  in  machine -readable  form  to  provide  selective  KWIC 
indexes  (Balz  and  Stanwood,  1963  l28]J  Black,  1962  [65 j;  Carroll  and  Summit,  1962  [102 j) 
machine  retrieval  of  item  citations  by  specified  keywords.     (Kennedy  1961  [311])  and 
selections  of  items  geared  to  a  Selective  Dissemination  of  Information  System  (Barnes  and 
Resnick,  1963  [36];  Balz  and  Stanwood,  1963  [28]).    Gallianza  and  Kennedy  at  the 
Lawrence  Radiation  Laboratory,  for  example,  report  as  being  under  development 
programs  for  the  IBM  T401  and  7090  computers  which  will  combine  KWIC  type  indexing 
features  with  the  logical  search  operators  "AND",   "OR",  and  "IF"  in  order  that  users 
may  specify  subject  searches  in  ordinary  English  language  terms. 

Clapp,  1963  [122],  p.  7, 

2/ 

Markus,   1962  [  394]  ,  p.  19. 

Black,   1962  [ 65] ,  p. 316. 

Bernier  and  Crane,   1962  [56],  p.  120. 

5/ 

National  Science  Foundation's'  CR&D  Report  No.  11,   [430]  ,  p.  42. 
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3.  2  Modified  Derivative  Indexing 

Some  of  the  more  obvious  of  the  disadvantages  of  KWIC  indexing  techniques  can  be 
reduced  if  not  eliminated  by  a  variety  of  human  and  machine  procedures.    These  include 
augmentation  of  titles  to  provide  additional  clues  to  subject  aspects,  manual  post- editing, 
and  synonym  reduction  through  such  devices  as  thesaurus  lookups. 

The  ink  was  scarcely  dry  on  the  first  issues  of  a  KWIC  index  before  a  number  of 
suggestions  for  improvements,  modifications,  and  augmentations  were  proffered  in  the 
literature.    In  fact,  both  Luhn  and  Baxendale  considered  various  possible  refinements  in 
their  original  proposals.    The  first  systematic  review  of  work  in  the  field  of  automatic 
extracting- -whether  to  produce  indexes  or  abstracts,  or  both- -was  made  by  Edmundson 
and  Wyllys  in  196l[(18l].    They  covered  not  only  the  KWIC  type  indexes  as  such,  but  also 
modifications  suggested  by  Baxendale,  Luhn,    Oswald  and  others,   and  they  themselves 
advanced  a  number  of  additional  possibilities.    Of  the  various  modifications  and  refine- 
ments that  have  been  suggested,  the  most  obvious  is  that  of  title  augmentation. 

3.2.1  Title  Augmentation 

The  machine-prepared  index  that  was  probably  the  first  to  go  into  productive  opera- 
tion is  actually  one  involving  title  and  subject  indicators  rather  than  pure  keyword-f rom- 
title  permutations.     The  CIA  project,  beginning  in  1952,  is  based  upon  manual  pre- 
editing  of  the  titles  themselves,  with  the  words  to  be  picked  up  as  index  entries  being 
underlined.    In  addition,  it  involves  assignment  of  other  words,  descriptors  or  terms 
from  a  hierarchical  classification  schedule  to  indicate  additional  access  points  (Veilleux, 
1961  [624]  . 

In  later  KWIC  type  indexing,  the  possibilities  of  improving  effectiveness  by  pre- 
editing  or  post- editing  to  modify  and  expand  titles  have  been  suggested  and  explored  by  a 
number  of  investigators.    The  semi-automatic  indexing  reported  by  Janaske  adds 
descriptive  words  or  phrases  in  parentheses  at  the  end  of  titles  and  uses  them  as 
additional  indexing  points  (Janaske,   1962  [299]).    At  Biological  Abstracts  Service, 
improvements  have  been  obtained  (without  sacrifice  in  the  speed  desired  in  order  to  index 
5,000  abstracts  twice  a  month)  by  title  supplementation  as  well  as  by  an  improved  stop 
list  and  by  post-editing  word  divisions  and  word  recombinations.  }J   Titles  for  each  of 
two  12,  000-item  bibliographies  in  the  field  of  radiobiology  are  reported  as  being  edited 
considerably  before  KWIC  type  processing.  ^/  Other  examples  of  modified  derivative 
indexing  based  on  title  augmentation  include  Chemical  Patents  — ,  the  Applied  Physics 
Letters  indexing  project  at  Oak  Ridge  National  Laboratory,  which  provides  for  an  author- 
prepared  form  to  describe  features  of  property  and  method  not  covered  in  the  title,  ^/ 
and  the  KWIC  Index  to  Neurochemistry  ([420]). 


Parkins,  1963,  [466],  p.  27. 
Davis,   1963  [l50l,p.  238. 

See  Markus,  1962  [394],  p.  19,  and  ref.  [662]. 
Connolly,    1963  [  136],  p.  35. 
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To  some  extent,  however,  the  use  of  human  editors  to  improve  the  product  of  KWIC 
type  indexing  defeats  the  initial  purpose  of  a  quick  and  purely  clerical  or  mechanical 
process.    Thus,  Dowell  and  Marshall  argue: 

".  .  .  The  basic  permuted-title  index  can  be  substantially  improved  by  editing  and  re- 
writing the  titles  before  they  are  submitted  to  the  computer.   .  .  .  But  this  of  course, 
destroys  the  great  advantage  claimed  for  the  permuted  title  index,   'that   it  is  a 
purely  clerical  process'.    Intellectual  effort  has  entered  the  picture  again  and  we 
are  back  where  we  started.  " 

In  the  extreme  case,  the  re-introduction  of  intellectual  effort  is  in  effect  the  re-introduc- 
tion of  conventional  human  indexing,  with  the  machine's  role  limited  to  that  of  compilation, 
as  in  the  case  of  the  "notation-of-content"  statements  prepared  for  NASA's  STAR 
System  (Slamecka  and  Zunde,   1963  [56l];  Newbaker  and  Savage,   1963  [430]). 

Kennedy  suggests  instead,  therefore,  that  the  augmentation  might  be  accomplished  by 
the  authors  themselves.    However,  it  may  then  be  pointed  out,  as  by  Bernier  and  Crane, 
for  example,  that  the  supplementation  of  titles  before  publication  in  order  to  provide 
suitable  additional  indexing  words  would  be  "awkward,   space-consuming  and  difficult". 
They  continue: 

"It  would  call  for  the  attention  of  index  experts  at  the  manuscript  stage,  which  would 
delay  publication  and  expand  the  total  indexing  effort.  Furthermore,  good,  thorough 
indexes  are  based  on  the  full  information  of  abstracts  and  papers,  not  on  their  titles 
only.  "  2^/ 

An  alternative  method  for  title  augmentation  to  improve  the  quality  of  KWIC  indexing 
is  therefore  to  establish  procedures  for  machine  selection  of  significant  words  from  more 
of  the  text  than  just  the  titles  alone.    In  fact,  Luhn  himself  did  not  limit  his  technique  as 
originally  proposed  to  titles  only  but  indicated  that  the  process  could  be  performed  at 
various  levels:    title,  abstract,  or  full  text.  ^1  In  the  1958  permuted  index  to  the  ICSI 
preprints,  entries  were  derived  from  titles,  author's  names,  author  affiliations,  headings 
within  the  paper,  figure  and  table  captions,  and  sentences  and  phrases  taken  directly 
from  text.  ^'  Combinations  of  human  and  machine  procedures  based  on  sentences  and 
phrases  selected  from  text  are  described  by  Herner  who  cites  a  two-fold  advantage: 
"First,  it  is  not  wholly  dependent  on  the  informativeness  or  lack  of  informativeness  of 
titles  and  bibliographic  citations,  and,   second,  it  affords  a  greater  depth  of  analysis  than 
is  generally  possible  where  titles  or  bibliographic  descriptions  alone  are  used.  "  5^/ 


1/ 

Dowell  and  Marshall,   196Z[l59],  p.  324-325. 
Bernier  and  Crane,   1962  [56],  p.  117. 
Luhn  1959  [  381]  ,  p.  289. 
Citron,  et  al,   1958  [120],  p.  i. 

5/ 

Herner,   1963  [264],  pp.  1-2. 
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Taking  more  text  as  the  basis  for  automatic  derivative  indexing  adds,   of  course,  the 
problems  and  costs  of  keystroking  additional  input  material.    At  the  same  time,  most  of 
the  major  problems  of  scatter  of  references,  synonymity,   redundancy  and  exclusive 
reliance  on  the  author's  own  language  and  terminology  not  only  remain  but  may  quite 
probably  be  intensified.     The  problems  of  establishing  suitatle  rules  for  selection  of 
significant  words  are  aggravated,  not  only  by  the  far  larger  number  of  different  words  to 
be  processed,  but  because  of  unresolved  problems  in  effectively  relating  length  of  index 
and  depth  of  indexing  to  the  length  of  the  document.  ]J 

There  are,  however,  a  number  of  practical  suggestions  by  which  machine  augmenta- 
tion of  titles  might  be  accomplished.    First  is  the  invariant  selection  of  words  that  are 
capitalized,  other  than  those  that  begin  a  sentence.       As  Wyllys  points  out,  this  type  of 
selection  criterion  would  emphasize  proper  names,  and  these  in  turn  might  be  particularly 
valuable  clues,  especially  in  a  military  intelligence  situation.  ^1  It  has  also  been 
suggested  that  the  selection  criteria  should  depend  on  particular  pre-specified  contexts, 
such  as  being  preceded  by  the  words:    "the  results  were.  .  .  ,  ",   "in  conclusion  .  .  .  ",  and 
the  like. 

A  second  type  of  machine  selection  procedure  is  the  converse  of  the  exclusion  or 
stop  list,  namely,  an  inclusion  list  or  dictionary  which  may  involve  especially  significant 
words  for  a  particular  subject  matter  area  or  words  that  are  of  importance  to  a  particular 
organization.    In  the  discussions  of  the  Area  5  ICSI  papers  it  was  remarked: 

"Another  complication  is  that  mechanized  indexing  finds  in  a  paper  what  was 
important  to  the  author.    What  happens  if  there  is  something  in  the  paper  not 
important  to  the  author  but  of  importance  to  the  indexer?    One  possibility  is 
to  have  a  list  of  words  and  phrases  expressing  the  interests  of  a  particular 
collection,  which  the  machine  looks  for  in  the  papers.    If  this  word  or  phrase 
occurs  even  once,  it  should  be  picked  up  as  an  indexing  term.  "  ^/ 


See,  for  example,  Wyllys,   1963  [653],  p.  22. 
See  Luhn,   1959  [37l],  p.  52;  [384],  p.  8. 

3/ 

Wyllys,   1963  [  653]  ,  p.  15. 

±1 

6ee  Ref.  [578  ],  p.  1263.    See  also,  among  others,   Luhn,  1959  [371],  p.  52:  "Just  as 
common  words  have  been  eliminated  by  look-up  in  a  special  index,  certain  essential 
words  may  be  looked  up  in  another  special  index  for  the  purpose  of  listing  them  under 
any  circumstances". 
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This  approach  to  the  selection  problem  can  be  combined  with  other  devices,  as  in  the 
"Selective  Dissemination"  system  described  by  Kraft  in  which  keyword  extraction  indexing 
is  applied  to  abstract,  title,  author's  name  and  manually  assigned  index  terms,  after 
processing  of  all  input  material  against  both  "in"  and  "out"  dictionary  lists.  ]_l 

The  use  of  abstracts  rather  than  full  text  as  source  material  makes  the  selection 
criteria  problems  somewhat  less  severe.    In  addition,  there  is  evidence  to  suggest  that 
the  abstract  does  contain  much  of  the  significant  information  that  would  normally  be 
indexed  and  the  text  of  the  abstract  is  therefore  a  fertile  field  for  title  augmentation.  In 
experiments  conducted  by  Slamecka  and  Zunde  on  the  comparison  of  indexing  terms 
manually  assigned  with  the  occurrences  of  the  names  of  these  terms  in  abstracts  used  in 
NASA's  STAR  system,  it  was  found  that  80.  4  percent  of  the  assigned  terms  were  contained 
in  the  abstracts.       Swanson,  on  the  other  hand,  suggests  that,  at  least  for  short  articles 
having  homogeneous  subject  matter,  title  and  first  paragraph  "are  nearly  as  good  as  full 
text.  "  II 

A  combination  inclusion- exclusion  list  system  may  involve  prior  "weighting  for 
relevance  "of  words  that  are  judged  by  human  analysts  to  be  significant  for  purposes  of 
search  and  retrieval,  as  suggested  by  Swanson,  for  example: 

"The  computer  first  separates  those  words  which  are  important  for  purposes  of 
information  retrieval  from  those  which  are  unimportant.    This  is  accomplished  by 
means  of  looking  up  each  word  in  an  alphabetized  word  list  with  which  the  computer 
is  furnished.    Each  word  in  this  word  list  carries  a  'weight'  which  reflects  an 
estimate  of  its  importance  for  retrieval  purposes.  Words  of  zero  weight  are 
completely  unimportant  and  discarded  by  the  computer  for  indexing  entries.  "  ^1 

Continuing  work  at  Thompson  Ramo- Wooldridge  on  automatic  indexing  methods  includes 
further  investigation  of  assignments  of  relevance  weight  estimates  to  words  and  phrases, 
(1959  [  490]  and  [491],  1963  [  602]). 

3.2.2  Book  Indexing  By  Computer 

For  internal  indexing,  that  is,  the  subject  indexing  of  the  contents  of  a  single  book  or 
report,  automatic  indexing  experiments  are  usually  directed  toward  the  processing  of 
full  text,  with  use  of  stop  lists  of  various  lengths.     The  work  of  Artandi  for  her  doctorate 
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Kraft,   1963  [  334],  pp.  69-70. 
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Slamecka  and  Zunde,   19  63  [56l].    In  addition  they  report  (p.  139)  that  a  large  number 
of  the  terms  not  found  were  "either  broad,  general  terms  (i.  e.  ,   'device')  or  generic 
level  concepts  of  terms  contained  in  the  abstracts.  " 

Swanson,   1963  [  580],  p.  1. 
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at  Rutgers  in  indexing  of  a  book  by  computer  programs  (1963  [20]  and  [  22])  is  an  example 
of  such  modified  derivative  indexing.    Specifically,  Artandi's  method  involves: 

(1)  Establishment  of  a  list  of  key  terms  appropriate  to  a  given  subject 
area  to  be  used  as  an  inclusion  list  for  word  extractions  from  text. 

(2)  Application  of  an  appropriate  syndetic  apparatus  to  be  used  in  the 
compilation  and  ordering  of  the  indev  entries. 

(3)  Means  for  the  automatic  selection  of  index  entries  other  than  those 
on  the  pre- specified  inclusion  list,  especially  for  the  selection  of 
proper  names. 

The  text  used  by  Artandi  for  her  study  consisted  of  a  59-page  chapter  on  halogens 
from  J.  W.  Mellor's  Modern  Inorganic  Chemistry.     This  text  was  keypunched  with 
special  tags  being  assigned  to  indicate  the  page  numbers  and  the  incidence  of  capitalized 
words  in  the  text.    Text  words  greater  than  three  characters  in  length  were  first  checked 
against  the  inclusion  dictionary  of  "detection  terms".     There  was,  in  addition,  an 
"expression  term"  dictionary  which  constituted  the  vocabulary  of  the  final  index  and  in 
which  a  given  expression  term  might  or  might  not  be  identical  with  the  corresponding 
detection  term.    Cross-references  were  supplied  by  a  program  routine  which  checks  the 
index  term  list  against  a  list  of  expression  terms  with  their  detection  terms  grouped 
under  them  and  which  compiles  cross-reference  entries,  one  for  each  detection  term 
associated  with  an  expression  term  appearing  on  the  index  list. 

For  her  experimental  corpus,  Artandi's  program  developed  363  page  references, 
138  different  index  entries  and  35  cross-references.    She  compared  these  results  with 
those  obtainable  by  conventional  human  indexing  with  respect  to  the  factors  of  heading 
density  (ratio  of  number  of  entries  to  number  of  words  in  the  book),  entry  density  (ratio 
of  the  number  of  page  references  to  the  number  of  pages),  and  distribution  (ratios  of 
aitries  for  chemical  compounds,  proper  names,  and  subject  entries  to  the  total  number 
of  entries.    No  indexing  errors  were  found  in  the  computer-generated  index  for  a  5 
percent  random  sample  of  the  pages  of  the  corpus,  but  five  omissions  were  found  in  the 
machine  indexing  of  these  sample  pages.    Artandi  concluded,  however,  that  although  the 
quality  of  indexing  appeared  favorable,  the  costs,  which  approximated  $1.  50  per  page 
indexed,  were  impractically  high. 

Book  indexing  by  computer  has  also  been  investigated  by  Maloney,  Dukes,  and  Green 
at  the  Army  Biological  Laboratories,  Fort  Detrick,  Maryland. -L^  Input  is  based  on  the  by- 
product paper  tape  generated  when  the  manuscript  is  typed  on  a  tape  typewriter.  The 
paper  tape  is  in  turn  converted  to  punched  cards  which  are  then  processed  by  a  UNIVAC 
SS-90  II  computer  in  an  editing  run  that  deletes  unrecognizable  codes  and  then  stores  page, 


u 

C.  J.  Maloney,  private  communication.    A  report  by  C.  J.  Maloney,  J.  Dukes,  and 
S.  Green,  "Indexing  reports  by  computer"  is  in  process  of  preparation  for 
publication. 
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line,  sentence  number  and  other  reference  identifications.    After  re-processing  against 
a  stop  list  of  common  words,  all  other  words  in  the  edited  text  are  selected  as 
candidate  index  entries,  these  are  then  sorted  into  alphabetical  order  with  subsequent 
printout  giving  each  word  occurrence  followed  by  the  entire  sentence  which  contained  it 
and  the  page  and  other  location  identifications.     This  computer  output  is  then  post- edited 
manually  not  only  to  eliminate  trivial  entries  but  also  to  normalize  terms  and  phrases 
used. 

3 .  2.  3  Modified  Derivative  Indexing  -  Baxendale's  Experiments 

As  has  been  previously  noted  in  the  introduction  to  this  report,   the  name  of  Phyllis 
Baxendale  together  with  that  of  H.  P.  Luhn  is  generally  accorded  credit  for  pioneering 
efforts  in  the  entire  area  of  automatic  indexing. Baxendale  in  particular  is  generally 
credited  with  the  first  actual  experiments  in  modified  derivative  indexing.    In  investiga- 
tion beginning  in  the  late  1950 's,   she  has  explored  not  only  statistical  approaches  to 
automatic  selection  of  index  terms  (based  for  example  on  word  frequencies)  but  also  the 
use  of  word  pairs,  word  groups,  contextual  associations,  and  in  particular  the  subject- 
indicating  clues  of  prepositional  phrases  (Baxendale,   1958  [4l],   1961  [40],   1962  [42]; 
Becker,   I960  [  44]  ;  Edmundson  and  Wyllys,   1961  [l8l]). 

Baxendale  began  by  considering  the  patterns  of  scanning  that  humans  typically  use 
to  select  "topic"  sentences,  phrases  and  words,  and  she  then  proceeded  to  simulate  by 
computer  program  the  selection  of  phrases  consisting  primarily  of  nouns  and  modifiers. 
In  her  first  experiments,  (1958  [4l])  she  used  two  methods  of  automatic  selection.  In 
the  first  procedure,  words  serving  the  grammatical  functions  of  pronoun,  article, 
auxiliary  verb,   conjunction  and  the  like,  were  deleted  by  stop  list  lookup.  Frequency 
count  statistics  were  then  derived  for  the  remaining  words.    In  her  second  procedure, 
the  computer  was  programmed  to  select  prepositional  phrases  from  text  and  to  use  the 
four  words  succeeding  the  preposition  as  index  entries  unless  an  additional  preposition  or 
a  punctuation  mark  is  first  encountered. 

In  later  experiments,  Baxendale  has  explored  possible  grammatical  models  "which 
would  select  all  and  only  novms  or  adjective -noun  combinations".  Jl^  Taking  as  an  initial 
corpus  a  sample  of  document  titles,   rules  were  devised  to  reject  for  human  analysis  titles 
with  question-marks  and  the  like,  to  eliminate  numeric  information  and  single  symbols, 
and  to  segment  the  title  into  its  component  clauses  and  phrases  by  the  detection  of 
commas,  periods,  and  similar  clues.    By  list  lookup,  certain  words  are  identified  as 
capable  of  serving  the  syntactic  functions  of  being  quantifiers,  prepositions,  or  clause 
introducers.  Special  subscripts  are  then  assigned  to  these  words  and  the  subscripts  are 
examined  by  machine  to  provide  further  segmentation;  to  delete  quantifiers,  auxiliary 
verbs,  or  words  ending  in  "ed"  or  "ing"  and  preceded  by  an  auxiliary  verb,  and  to  deter- 
mine relationship  functions  between  the  remaining,  presumably  substantive,  words. 

Still  other  work  by  Baxendale  has  been  directed  toward  the  development  of  frequency 
of  co-occurrence  or  textual  association  of  candidate  indexing  terms.    She  reports  as 
ll  follows: 
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"[In  the  frequency  matrix  j,  .  .  the  diagonal  elements  .  .  .  give  the  total  freqmency  of  an 
index  term  and  the  off-diagonal  gives  the  frequency  of  co-occurrence  of  two  terms. 
The  diagonal  of  the  'context'  matrix  represents  that  portion  of  the  total  vocabulary 
with  which  an  individual  term  has  been  coordinated,  and  the  off- diagonal  the  extent 
to  which  two  terms  have  common  context.  .  .  Such  matrices  give  a  basis  for  examining 
the  extent  to  which  terms  are  generic  or  specific  within  the  context  of  the  collection 
of  documents.    One  can  speculate  that  terms  occurring  with  high  frequency  and  wide 
context,  i.  e.  ,  with  frequencies  distributed  amongst  all  or  nearly  all  off-diagonal 
elements  of  the  matrix  are  of  such  broad  connotation  as  to  be  indifferent  discrimina- 
tors of  content  .  .  .     The  frequency  and  context  matrices  can  again  be  used  to  deter- 
mine the  modifiers  with  which  they  can  most  rneaningfully  be  coupled  for  the 
collection  of  documents  being  considered.  "  — 

Finally,  Baxendale  notes  that  on  the  basis  of  her  studies  it  should  be  possible  to 
select  quasi-subject  headings  based  on  frequency  counting  criteria,  but  then  to  order  the 
remaining  vocabulary  of  selected  terms    according  to  contextual  measures  of  association 
which  are  semantic,  syntactic,  or  statistical  in  nature.    Experimental  results  for  a 
collection  of  1,  500  documents  included  semantic  associations  between  "searching"  and 
"retrieval",   syntactic  associations  of  "machine"  or  "literature"  with  "retrieval",  and 
the  apparently  misleading  association  of  "metal"  with  "retrieval"  which,  however,  had 
statistical  significance  within  the  particular  document  sample.  — 

Other  investigators  who  have  explored  noun-adjective  clues  for  selection  include 
Anger,  Chonez,  Langleben  and  Shumilina,  and  Swanson.    Anger  looked  for  relationships 
indicated  by  syntactic  dependencies  or  by  noun- adjective  and  adjective-adverb  linkages, 
and  gave  in  an  appendix  a  suggested  program  for  phrase  inversions.  ^/  Chonez  has 
described  a  computer  program  which  by  recognizing  "separating"  words,  especially 
prepositions,  and  applying  "pseudo- grammatical"  rules  compiles  an  index  to  English 
language  items  in  the  fields  of  ionized  gas  physics  and  thermonuclear  fusion.    It  is 
claimed  that: 

"The  subject  index  thus  prepared  is  similar  in  presentation  to  Luhn's  KWIC  indexes, 
but  is  fundamentally  different  in  conception  and  is  in  fact  intermediate  between.  .  . 
(this)  .  .  .  and  the  conventional  alphabetic  subject  indexes.  "  ^/ 

Langleben  and  Shumilina  are  concerned  with  machine-aided  procedures  for  trans- 
lation from  natural  language  materials  to  an  intermediary  or  documentation  language. 
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I    They  indicate,  for  example,  that  the  preposition  "from"  serves  as  a  key  for  the  treatment 

1  / 

I    of  two  nouns  connected  by  it.  _'  Swanson,   describing  research  project  progress  at  Ramo 
Wooldridge  as  of  I960,    reported  to  the  National  Symposium  on  Machine  Translation  with 
I     respect  to  multiple  meaning  problems  as  follows: 

"We  are  also  investigating  the  possibility  of  discovering  semantic  attributes  of 
words  based  upon  certain  automatically  recognizable  statistical  features  of  the 
context.     Our  initial  endeavor  in  this  direction  has  been  to  attempt  to  discover 
a  classification  system  for  nouns  based  upon  their  frequency  spectrum  of  cate-  ^ 
gories  of  modifying  adjectives,   these  categories  being  automatically  recognizable.  "— 

i 

3.  3  Derivative  Indexing  From  Automatic  Abstracting  Techniques 

While  Baxendale's  work  has  had  certain  points  in  common  with  automatic  abstracting 
or  extracting  processes,  particularly  in  the  use  of  word  frequency  statistics  and  the 
consideration  of  possibilities  for  first  selecting  topic  sentences,  her  major  interests  in 
1    this  area  have  been  in  automatic  indexing  as  such,   rather  than  in  machine  selection  of 
]    sentences  from  text  to  serve  as  an  automatic  extract  or  derivative  abstract  of  the 
'     document.     Much  of  the  machine  processing  to  date  of  full  text  for  documentation 
]    purposes,  however,  has  had  the  latter  goal  as  the  principal  research  objective. 

As  we  have  previously  noted,  the  subject  of  automatic  abstracting  or  auto- 
condensation  is  not  in  itself  a  primary  concern  of  this  survey.     Nevertheless,  the  signifi- 
cant words  occurring  in  the  abstract  of  a  document,  whether  generated  by  man  or  by 
machine,   are  obviously  good  candidates  for  indexing  terms.    Moreover,  it  has  been 
strongly  suggested  that  the  questions  of  using  positional,   editorial,   and  syntactical  clues 
in  order  to  improve  automatic  indexing  techniques  will  profit  by  research  that  is  being 
done  in  both  automatic  extracting  procedures  and  in  other  types  of  linguistic  data  pro- 
cessing based  upon  full  text.  ^/ 

3.  3.  1  Auto- Condensation  and  Auto-Encoding  Techniques  of  H.   P.  Luhn 

Although  Luhn's  work  in  the  field  of  documentation  aided  by  machine  has  had  its  best 
known  and  most  popular  acceptance  with  respect  to  the  KWIC  index  proper,   even  more 
provocative  possibilities  lie  in  the  development  of  some  of  the  auto- condensation  and  auto- 
encoding  techniques  which  he  also  proposed,   especially  for  full  text  processing.    In  this 
area,  although  he  himself  has  also  suggested  a  variety  of  possible  improvements  and 
refinements,  the  actual  experimental  work  done  by  him  and  by  his  associates  has  mostly 
been  done  on  the  basis  of  word  frequency  statistics. 
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Swanson,  1961  [  585],  pp.  391-392. 
1  See,  for  example,  Wyllys,   1963  [  653],  p.  7. 
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Considering  first  the  most  frequently  occurring  words  in  a  given  text  as  too  common 
to  be  subject-indicative  (those  usually  stopped  or  purged  by  a  suitable  exclusion  dictionary 
or  stop  list,  for  example)  and  next  the  least  frequent  words  as  being  rarely  topical  in  a 
content- revealing  sense,  Luhn  settles  upon  a  middle  range  of  frequency  of  word  occur- 
rence as  the  basis  for  his  auto- condensation  processes.     The  actual  frequency  counts  are 
computed,  together  with  indications  of  page,  line,  and  occurrence  within  the  same 
sentence.    When  this  has  been  done  for  the  complete  text,   each  individual  sentence  is  then 
checked  for  the  "score"  of  relatively  high  frequency  words  occurring  in  it,  and  sentences 
with  the  highest  scores  are  then  automatically  selected,  in  textually-occurring  order,  and 
are  printed  out  as  an  abstract,  more  properly  an  extract,  of  the  document. 

The  automatic  encoding  of  documents  may  be  achieved  either  by  taking  the  high 
ranking  words  of  the  selected  sentences  or  by  selecting  the  highest  ranking  of  the  words 
in  the  entire  document  as  index  entries.    Luhn  typically  justifies  these  procedures  as 
follows : 

"Of  various  automatic  procedures  for  deriving  typical  patterns  for  characterizing 
documents,  the  systems  here  proposed  are  based  on  operations  involving 
statistical  properties  of  words  ...    It  is  held  that  the  more  often  a  certain  word 
appears  in  a  document  the  more  it  becomes  representative  of  the  subject  matter 
treated  by  the  author.    In  grading  words  in  accordance  with  the  frequency  of  usage 
within  a  document,  a  pattern  is  derived  which  is  typical  of  that  document  and  unique 
amongst  all  similarly  derived  patterns  of  a  collection  of  documents.    It  is  proposed 
that  the  more  similar  two  such  patterns  are  the  more  similar  is  the  intellectual 
contents  of  the  documents  they  represent.  .  . 

"...   The  creation  of  an  encoding  pattern  may  consist  of  listing  an  appropriate 
portion  of  the  words  ranking  highest  on  the  word  frequency  list  derived  from  a 
document.    Experiments  conducted  so  far  on  documents  ranging  in  size  from  500 
to  5000  words  have  indicated  that  word  patterns  consisting  of  from  ten  to  twenty- 
four  of  the  highest  ranking  words  furnish  adequate  discrimination  and  resolution 
for  retrieval,  sixteen  such  words  being  a  likely  average.  "  \J 

At  Wright-Patterson   Air  Force  Base    an  automated  information  selection  and 
retrieval  system  has  been  developed  jointly  by  Air  Force  and  IBM  personnel 
(Gallagher  and  Toomey,   1963  [205]).    It  involves  both  auto-indexing  and  auto- 
abstracting  techniques  following  the  Luhn  word-frequency-counting  techniques.  Pre- 
editing  is  applied  to  demarcate  fields  (e.  g.  ,  title,  author)  and  to  flag  certain  text  words, 
particularly  proper  names,  for  special  treatment.    Special  treatment,  over  and  above  the 
frequency-based  selection  score,  is  also  given  to  words  in  the  title  field. 

On  the  abstracting  side,  modifications  to  the  original  Luhn  formula  involve 
segmenting  sentences  in  terms  of  strings  of  both  high  and  low  valued  words  separated 
by  either  periods  or  continuous  strings  of  low  valued  words,  on  the  assumption  that 
long  consecutive  strings  of  low  value  words  should  weight  negatively.     The  automatic 
extract  cons  ists  of  the  highest  ranking  20  percent  of  the  sentences  subject  to  the 
restriction  that  no  less  than  7  and  no  more  than  20  sentences  should  be  selected.    On  the 
indexing  side,  the  investigators  report: 


Luhn,   1959  [371],  p.  47. 
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"As  it  is  currently  run,  the  auto-indexing  program  selects  about  one  word  in  ten 
as  a  keyword  in  articles  of  three  thousand  words  or  less.    In  articles  longer  than 
three  thousand  words  it  tends  to  pick  about  one  word  in  fifteen.    This  high  incidence 
of  keywords  naturally  increases  the  amount  of  noise  results  returned  by  the  query 
program,  although  good  search  strategy  cuts  them  down  considerably.  "  ]J 

As  of  October  1963,  the  system  was  reported  to  be  fully  operative  although  not  as 
yet  extensively  tested  in  actual  use.     Gallagher  and  Toomey  give  illustrative  auto-extract 
results  on  two  tested  papers,  one  being  Luhn's  own  "Automatic  Creation  of  Literature 
Abstracts".     They  give  comparative  results  for  manual  versus  machine  selection  of  key- 
words as  index  or  search  terms  with  88.  6  percent  agreement,  the  human  indexers  having 
selected,  in  6  tests  reported,   132  words  and  the  machine  method  117.  Modifications 
under  consideration  include  pre-edit  flagging  of  terms  in  author  and  cited- reference 
fields  for  special  weighting,   setting  the  length  of  the  abstract  as  a  function  of  the  total 
number  of   woras  in  an  item,   and,   in  the  search  program,   generating  additional  search 
terms  by  means  of  association  factor  techniques  such  as  those  suggested  by  Stiles. 

To  the  basic  approach  of  straight-forward  word  frequency  counting,  Luhn  himself 
has  suggested  that  improvements  might  be  obtained  from  considering  closely  adjacent 
words,  ^/  word  pairs,  ^1  and  reference  to  vocabularies  specific  to  a  given  field.  ^1 
Other  possibilities  are  capitalized  words  and  lookup  against  an  inclusion  list.    He  also 
suggests: 

"If  certain  words  could  be  given  in  their  relationships  to  other  words,  more 
specific  meanings  may  be  identified  by  such  combinations.    These  relationships 
may  range  from  the  mere  co-occurrence  of  certain  words  within  a  phrase  or 
sentence  to  the  combinations  of  specific  parts  of  speech.  "  5^/ 

Various  investigators  have  proceeded  to  explore  these  and  other  possible  improve- 
ments, including  incorporation  of  relative  frequency  information,  use  of  information 
about  distances  between  high-ranked  significant  words,  word  pairs  and  word  n-tuples, 
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and  other  devices  to  improve  detection  of  significant  clues  to  subject  content.  Repre- 
sentative examples  of  such  work  will  be  discussed  below.    In  addition,  investigators 
abroad  have  developed  modifications  to  the  basic  Luhn  word  frequency  approach  which 
appear  to  be  necessary  when  it  is  applied  to  languages  other  than  English,  i.^ 

Thus,  for  example,  Purto  reports  various  investigations  conducted  by  V.  A.  Argayev 
and  V.V.Borodin  and  by  himself  with  respect  to  Russian  language  documents.^/  Purto 
notes  first  that  the  Luhn  method  as  applied  to  Russian  language  materials  selects 
sentences  which,  while  having  the  largest  "significance  coefficients",  were  not  those  most 
essential  to  the  meaning  and  further  that:    "an  abstract  in  Russian  made  by  Luhn's  method 
results  in  a  choice  of  sentences  not  conveying  basic  information  and  not  logically  connected 
with  each  other.  "        The  reasons  for  such  failure  he  attributes  to  the  fact  that  words  with 
different  frequencies  are  considered  equally  important  within  a  sentence  for  sentence 
selection  purposes  and  to  the  lack  of  consideration  for  semantic  and  grammatical 
connectivity  between  significant  words  and  between  sentences.    He  then  discusses  several 
methods  for  determining  connectivity,  such  as  the  rule  that  the  sentences  most  closely 
connected  with  each  other  will  be  those  in  which  the  greatest  number  of  the  same  signifi- 
cant words  occur.  ^/ 

A  somewhat  different  example  of  difficulties  occurring  when  the  basic  Luhn  technique 
is  applied  to  material  in  languages  other  than  English  is  given  by  Levery.    He  describes 
a  study  of  thirty  French  texts  concerned  with  the  development  and  manufacture  of  glass. 
He  reports  as  follows: 

"While  we  followed  the  classical  idea  that  a  relationship  between  the  frequency  of 
a  word  and  its  significance  exists,  the  fact  that  we  worked  with  French  texts  forced 
us  to  discount  the  value  of  frequency  alone. 

"French  authors  generally  do  not  like  to  repeat  the  same  words,  and  they  vary  their 
v_  ;abulary.  .  .  It  was  necessary  to  combine  the  frequencies  of  words  with  the  same 
meanings  or  related  to  the  same  idea.  " 

"A  dictionary  of  synonyms  was  constructed.  .  .  (and)  different  versions  of  the  same 
word  had  to  be  regrouped.  "  ^1 


II 

Note,  however,  that  in  the  automatic  abstracting  program  at  Thompson  Ramo- 
Wooldridge,   small-scale  experiments  suggest  that  automatic  abstracting  is 
as  feasible  for  other  Indo-European  languages  as  for  English,  (1963  [  603],  p.ii). 
Also,  at  the  Centre  d'Etudes  Nucleaire  Saclay,  automatic  extraction  experiments 
are  being  applied  to  texts  both  in  French  and  other  languages,  see  National  Science 
Foundation's  CR&D  report  No.  6,  [430],  p.  20. 

Purto,  1962  [484].  He  refers  to  a  report  "The  problem  of  automatic  abstracting 
and  a  means  of  solving  it",  by  Argayev  and  Borodin,  apparently  available  only  as 
a  typescript  dated  1959. 

3/ 

Ibid,  p.  3. 
Ibid,  pp.  3-4. 

Levery,   1963  [359],  p.  235. 
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3.3.2  Frequencies  of  Word  n-tuples  -  Oswald  and  Others 

The  first  alternative  to  the  basic  Luhn  word  frequency  approach  in  automatic  ab- 
stracting techniques  to  be  actively  explored  was  apparently  that  of  Oswald  and  his 
associates.    (Oswald  et  al,   1959  [459];  Edmundson  et  al,   1959  [l80]).     Like  Baxendale, 
Oswald  was  interested  in  word  pairs  and  word  groups,  particularly  compound -noun  and 
adjective-nourr  compositions,  as  more  revelatory  of  meaning  than  single  words.  Unlike 
Baxendale,  however,  he  was  interested  in  the  word  group  itself  as  selection  criterion, 
whereas  she  had  used  word  group  or  phrase  clues  for  the  selection  of  (usually)  single 
indexing  terms.    Differences  between  their  two  approaches,  both  representing  very  early 
efforts  in  the  field,  are  summarized  by  Edmondson  and  Wyllys  as  follows: 

"Oswald's  experiment  in  automatic  abstracting  differs  from  Luhn's  and  Baxendale's 
techniques  in  that  it  combines  the  notion  of  significance  as  a  function  of  word 
frequency  and  the  notion  of  significance  as  a  function  of  word  groupings,  by  employing 
juxtapositions  of  significant  words  as  the  basic  unit  for  measuring  the  importance 
of  a  sentence. . . 

"It  may  further  be  observed  that  Baxendale's  exhibited  indexes  are  made  up  of  single 
words  rather  than  word  groups,   in  spite  of  the  strong  case  she  makes  for  using 
groups. . . 

"Baxendale's  work  is  concerned  solely  with  the  automatic  construction  of  indexes; 
she  does  not  extend  her  treatment  of  word  significance  into  the  area  of  automatic 
abstracting.  "  }_l 

Oswald's  "m\iltiterms" ,  however,  were  intended  to  overcome,  in  the  areas  of  both 
automatic  indexing  and  automatic  abstracting,  at  least  some  of  the  difficulty  that  concepts 
are  often  expressed  in  compound  nouns,  word  pairs,  and  longer  groups  of  words  consist- 
ing of  n-tuples  of  substantive  words  or  of  phrases.     The  result  of  consiaering  both  word 
frequency  and  word-group  frequency  is  that  in  Oswald's  selection-groups  it  is  usually  the 
case  that  only  one  word  of  the  group  has  an  individually  high  frequency  but  the  co- 
occurrence feature  heightens  the  significance  of  the  relatively  lower  frequency  words 
with  which  it  appears.     Thus,  for  automatic  indexing,  Oswald  proposed  significant  word 
groups  as  indexing  terms,  and  his  criteria  for  selection  of  sentences  to  be  included  in 
machine-generated  extracts  are  similarly  based  on  the  number  of  significant  groups  in 
the  sentences  chosen. 

Other  investigators  who  have  stressed  the  importance  of  word  pairs  and  longer  groups 
as  necessary  to  reflect  concepts  include  Bar-Hillel  (1959  [33]),  Black(1963  [64]),  Clark 
(I960  [123]),  Doyle  (1959  [  165]  ),  and  Salton  (1963  [  519]  )•    Doyle  says  succinctly  that 
"when  a  phrase,  or  some  other  aggregation  of  words,   stands  for  a  single  idea,  its 
frequency  in  a  document  ought  to  interest  us  more  than  the  frequencies  of  its  component 
words.  "  ^/  Salton  considers  it  desirable  to  use  word  groups  rather  than  individual  words 


1/ 

Edmundson  and  Wyllys,  1961  [l8l],  pp.231 -232. 
Doyle.  1959  [  165]  ,  p.  11. 


79 


for  purposes  of  identifying  document  contents  and  to  use  data  on  the  joint  occurrence  of 
words  in  the  same  sentence  or  similar  contexts  as  grouping  criteria.    Clark  points  out  in 
particular  that  the  use  of  ordered  pairs  and  longer  sequences  of  words  to  express  a  single 
concept  may  be  highly  characteristic  of  the  special  technical  language  used  in  a  specific 
subject  field,  and  notably  those  of  the  social  sciences.  1/ 

Others  who  have  explored  word  n-tuples  as  selection  criteria  for  automatic  extraction 
operations  include  such  investigators  as  Szemere,  Levery,  and  Yakushin.  Szemere 
'reports  an  investigation  of  39  Swedish  patent  specifications  in  the  field  of 
switching  circuits  looking  for  significant  word-pairs,  with  emphasis  on  noun -adjective 
combinations  (1962  [  59l])-     The  objectives  of  a  project  headed  by  Levery  at  IBM  -  France 
have  been  reported  as  follows: 


"A  series  of  experiments  is  planned  in  the  fields  of  automatic  indexing  of 
technical  texts  and  technical  vocabulary  analysis. 

"A  statistical  method  will  be  tested  to  determine  the  degree  of  closeness  in 
meaning  of  words.     The  method  will  consist  of  studying  the  pairs  of  words  which 
appear  together  in  the  majority  of  texts  and  calculating  a  coefficient  of  corre- 
lation from  the  frequencies.    Such  work  will  result  in  a  standard  list  of  notions 
frequencies  for  a  particular  kind  of  information. 

"Starting  from  this  list,  new  experiments  will  be  made  so  as  to  obtain  a  list 
of  keywords  representing  each  text.    The  method  will  use  statistical  comparison 
between  the  distribution  of  frequencies  of  notions  contained  in  a  text  and  the 
standard  distributions  obtained  for  the  entire  corpus.  " 

2/ 

Yakushin(1963  [654])  develops  a  variation  of  the  word-pair  principle  in  which  he 
looks  for  those  pairs  where  the  words  are,  or  suggest,  names  of  objects,  such  as 
"table-leg".    He  suggests,  further,  that  so-called  "basis  nouns"  can  be    established  for 
a  given  scientific  field  and  entered  into  an  inclusion  dictionary,  which  also  contains  codes 
for  the  lexical  classes  to  which  the  word  can  belong  and  codes  for  determining  whether  or 
not  the  word  can  join  with  another  as  a  "basis  term".    Machine  routines  are  then 
suggested  to  develop  whether  or  not  given  terms  are  jointly  part  of  the  same  text,  whether 
one  textually  precedes  another  in  a  given  text,  whether  or  not  there  is  a  "nomenclator" 
pair.    Depending  upon  the  frequency  of  occurrence  of  identical  or  semantically  related 
nomenclator  constructions,  it  is  claimed  that  subject  concepts  can  be  detected.    That  is: 

"The  method  is  founded  on  the  finding  in  a  text  of  so-called  basis  terms, 
established  by  list,  and  of  the  words  which  explain  them.    These  explanatory 
words,  which  in  different  contexts  refer  to  one  basis  term,  are  grouped  and 
ordered  according  to  definite  rules  into  a  subject  concept.  " 

3/ 


Clark,   I960  [ 123] ,  p.  460. 

National  Science  Foundation's  CR&D  report  no.  11,  [430],  p.  118. 
Yakushin,   1963  [654],  p.  l6. 
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3.  3.  3  Relative  Frequency  Techniques  -  Edmundson  and  Wyllys,  and  Others 

The  first  comprehensive  critique  of  word  frequency  approaches  to  automatic  extract- 
ing and  indexing  was  undoubtedly  that  of  Bar-Hillel  (1959  [33],  I960  [34])i  followed  closely 
by  Edmundson  and  Wyllys  (1961  [l8l]),  who  themselves  have  experimented  with  various 
alternative  or  improved  methods  for  obtaining  measures  of  word  significance  by  statistical 
analysis.    These  critics  have  been  in  agreement  both  on  many  points  of  specific  criticism 
ajid  on  suggested  possibilities  for  amelioration  of  observed  difficulties,  especially  in 
terms  of  considering  relative  word  frequencies  within  a  particular  subject  field.  In 
addition,   several  other  investigators  independently  proposed  a  relative  frequency  approach 
at  about  the  same  time.  iJ 

Some  typical  expressions  of  opinion  on  the  importance  of  relative  frequency  criteria 
are  as  follows: 

"Let  me  propose  here  a  system  of  auto-indexing  which,  to  my  knowledge,  has  never 
been  publicly  proposed  before  in  this  form  and  which  seems  to  me  superior  to  any 
other  system  1  have  heard  of  .  .  .  Assume  that  .  .  .  we  are  given  a  list  of  the  average 
relative  frequencies  of  all  English  'words'  ...  It  would  then  be  possible,  for  any 
given  document,  to  rank- order  all  the  'words'  occurring  in  this  document  according 
to  the  excess  of  their  relative  frequency  within  the  document  over  their  average 
relative  frequency.    By  some  mechanically  implementable  standard  or  other,  an 
initial  segment  of  this  list  is  selected  as  the  index-set.  "  ^/ 

"Very  general  considerations  from  information  theory  suggest  that  a  word's 
information  should  vary  inversely  with  its  frequency  rather  than  directly,  its 
lower  probability  evidencing  greater  selectivity  or  deliberation  in  its  use.    It  is 
the  rare,   special,  or  technical  word  that  will  indicate  most  strongly  the  subject 
of  an  author's  discussion.    Here,  however,  it  is  clear  that  by  'rare'  we  must 
mean  rare  in  general  usage,  not  rare  within  the  document  itself.    In  fact  it  would 
seem  natural  to  regard  the  contrast  between  the  word's  relative  frequency  f 
within  the  document  and  its  relative  frequency  r  in  general  use  ...  as  a  more  re- 
vealing indication  of  the  word's  value  in  indicating  the  subject,  matter  of  a 
document.  "  2.^ 

Ti 

Compare,  for  example,  Kochen,   1963  [  327],  p.  7:    "The  idea  of  contrasting  words 
which  occur  frequently  in  a  document  against  the  frequency  of  this  word  in  the 
background  language  for  purposes  of  selecting  index  terms  seem  to  have  been 
suggested  first  by  Bohnert  and  the  author,  then  described  in  more  detail  by 
Edmundson  and  Wyllys,  and  tested  empirically  by  Damerau.    Something  similar 
was  suggested  even  earlier  by  Bar-Hillel."  See  Bar-Hillel,   196Z  [35],  p.  418, 
footnote,  with  respect  to  himself,  Edmundson,  and  Bohnert.    See  also,  however, 
Doyle  1962  [  163]  ,  p.  388:    "Edmund  son  and  Wyllys  were  probably  the  first  to 
publicly  advocate  contrasting  word  frequencies  within  a  document  to  word  fre- 
quencies within  a  given  field  and  using  these  relative  frequencies  as  criteria  for 
scoring  and  selecting  sentences.  " 

2/ 

Bar-Hillel,   1959  [33],  pp  4-8-9. 

II 

Edmundson  and  Wyllys,   1961  [l8l],  p.  227. 
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"We  naturally  find  that  the  words  of  greatest  interest  are  those  for  which  there 
exists  the  greatest  contrast  between  general  usage  frequency  and  local  (within  the 
article)  usage  frequency.  "  i./ 

"Luhn  has  bypassed  syntactical  analysis  by  taking  advantage  of  the  information 
content  of  the  most  frequently  used  topical  words  in  articles  .  .  .  Edmundson  et  al 
take  a  further  step  in  a  desirable  direction  by  bringing  in  information  from  outside 
the  article  being  analyzed:    words  and  terms  are  given  greater  topical  value  as  the 
contrast  increases  between  the  frequency  of  use  within  the  article  and  the  rarity  of 
general  usage. "  ^1 

"A  further  refinement  of  the  process  of  automatic  analysis  would  be  the  develop- 
ment of  special  sets  of  reference  frequencies  for  special  fields  of  interest.  This 
would  have  two  benefits:    it  would  become  possible  to  classify  documents  as  to 
field,  and  it  would  become  possible  to  note  the  significance  of  words  which  are 
frequent  in  the  document  and  frequent  in  a  very  large  reference  class  Cq  of 
literature  (i.  e.  ,  these  words  would  not  be  significant  with  respect  to  Cq)  but  which 
are  rare  in  the  special  field.    For  example,  the  word  'emotion'  might  be  too 
common  in  general  usage  to  seem  significant,  but  frequent  occurrence  of  the  word 
would  stand  out  in  a  paper  on  electronic  circuitry  (e.g.  ,  of  a  robot)  when  compared 
with  its  frequency  in  general  electrical  engineering  literature." 

"One  of  the  .  .  .  goals  is  to  investigate  a  relative-frequency  approach  to  the  cate- 
gorization of  documents.  .  .  For  this  investigation  it  will  be  necessary  to  develop 
sets  of  reference  frequencies  for  words  used  in  different  subject  fields.    It  was 
suggested  by  Edmundson  and  Wyllys  that  these  sets  of  reference  frequencies, 
when  developed,   could  be  used  to  categorize  a  document  as  belonging  to  a  particular 
subject-field,  by  means  of  measuring  the  degree  of  matching  (e.  g.  ,  with  the  chi- 
squared  test)  between  the  proportional  frequencies  of  words  in  the  documents  and 
the  sets  of  reference  frequencies.  "  4/ 

Two  points  in  the  comments  quoted  above  appear  especially  worthy  of  note.    The  first 
is  that  of  introducing  at  least  some  measure  of  reference  to  material  other  than  the 
individual  author's  own  choice  of  linguistic  expression  and  specific  terms.    We  shall  dis- 
cuss this  factor  in  more  detail  in  a  later  section  of  this  report.    The  second  point, 
derived  in  part  from  the  first,  is  the  specific  suggestion  of  movement  away  from  purely 
derivative  indexing  by  machine  in  the  direction  of  automatic  assignment  indexing  and 
automatic  categorization  or  classification. 


Doyle,  1959  [  165]  ,  p.  9. 
Doyle,   1961  [  I69],  p.  3. 

Edmundson  and  Wyllys,   1961  [I8I],  p.  228. 

±1 

Wyllys,   1963  [653],  p.  10. 
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Actual  experiments  in  application  of  relative  frequency  techniques  to  automatic  ex- 
tracting processes  have  been  pursued  since  1959  by  various  investigators.  Edmundson 
and  Wyllys  and  Damerau  (1963  [l48])were  certainly  among  the  first.    Edmundson  and 
Bohnert  were  engaged  in  experimental  investigations  at  Planning  Research  Corporation 
m  1959,  1/and  the  following  year  Edmiindson,  Oswald,  and  Wyllys  worked  on  the  auto- 
indexing  and  auto-extracting  of  the  40,  000  words  of  text  contained  in  nine  articles  in  the 
subject  field  of  missilery.  ]J  Wyllys  has  continued  work  on  relative  frequencies 
(1963  [653]  ).    At  the  System  Development  Corporation  Doyle,  in  some  of  his  work, has  also 
explored  the  relative  frequency  approach  (19  61  [  I6l]).     An  example  in  Europe  is  work 
reported  by  Meyer-Uhlenried  and  Lustig,  where  significant  keywords  from  abstracts  are 
used  not  only  as  indexing  terms  directly,  but  by  means  of  keyword  lists  and  micro- 
thesauri  can  also  be  used  to  assign  documents  to  specific  subject  fields  (1963  [417]). 

3.  3.  4  Significant  Word  Distances 

Another  technique  that  has  been  investigated  for  the  improvement  of  automatic  ex- 
traction operations  based  on  the  statistics  of  word  frequencies  is  that  of  distances  between 
significant  words.     The  desirability  of  attaching  greater  weight  to  n-tuples  of  immediately 
adjacent  words  and  to  the  co-occurrences  of  words  within  the  same  sentence  has  been 
mentioned  previously.    Savage,  in  relatively    early  work  developing  some  of  the  initial 
proposals  of  Luhn,   considered  intra- sentence  distances  between  significant  words  as 
follows : 

"...   The  criterion  is  the  relationship  of  the  high-frequency  words  to  each  other, 
rather  than  their  distribution  over  the  whole  sentence.     Consequently,   it  seems 
reasonable  to  consider  only  those  portions  of  sentences  which  are  bracketed  by 
high-frequency  words  and  to  set  a  limit  for  the  distance  at  which  any  two  such 
words  shall  be  considered  as  being  significantly  related  .  .  .  An  analysis  of  many 
sentences  and  many  documents  indicates  that  a  useful  limit  is  four  or  five  non- 
significant words  between  ariy  two  high-frequency  words.  "  ^1 

Doyle  has  also  noted  the  tendency  of  words  that  are  in  fact  highly  related  in  a  content- 
revealing  sense  to  co-occur  in  the  same  sentence  or  as  quite  direct  neighbors.     The  same 
investigator  has  also  suggested  that  word  distances  can  be  used  to  provide  "clustering" 
effects  that  might,  for  example,   sort  out  the  possibly  different  topics  covere^^in  intro- 
ductory or  background  discussions,   the  main  text,   and  various  appendices.  — 


1/ 
3/ 


4/ 


National  Science  Foundation's  CR&D  Report  No.  5,  [430],  p33;  Bar-Hillel 
1962  [35],  p.  418. 

National  Science  Foundation's  CR&D  Report  No.   6  [430],  pp  43-44. 

Savage  1958  [5Zl],  p.  4.      Later  related  work  has  included  a  method  for  generating 
auto- extracts  which  adds  to  the  high-frequency  word  sentence  scores  a  correction 
factor  for  the  number  of  words  in  gaps  between  such  words.    (See  Rath  et  al,  1961 
[493].) 

Doyle  1961  [  166]  ,  p.  7. 
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Related  research  efforts  in  more  general  areas  of  linguistic  data  processing  suggest 
inter-sentence  distances  as  criteria  for  the  selection  of  words  and  word  groups  in  auto- 
matic indexing  and  abstracting  processes.    In  natural  language  text  searching,  for  example, 
the  work  of  both  Swanson  (I960  [  587]  ,   1961  [  586]  ,   1963  [583]),  and  of  Maron  and  Ray  1/ 
isuggests  that  limitation  of  searching  to  a  four- sentence  span  would  eliminate  a  number  of 
irrelevant  responses  to  search  requests  specifying  the  joint  occurrence  of  two  or  more 
words . 

Swcinson's  findings  indicated  that  if  two  words  or  phrases  contained  in  the  sear«_h 
request  were  found  in  textual  proximity  within  these  limits,  they  were  highly  likely  to  bear 
a  semantic  relationship  that  is  what  was  intended  by  the  requester.    Applying  the  four- 
sentence  proximity  criterion,  it  was  found  that  the  amount  of  irrelevant  material  retrieved 
by  the  text  searching  system  could  be  reduced  by  60  percent  without  serious  loss  of 
relevant  information.  ^/   Black  cites  the  four-sentence  proximity  criterion  and  notes 
further  that  it  might  be  used  also  to  retrieve  only  a  paragraph  or  similar  small  portion  of 
the  full  text,  reducing  the  amount  of  material  to  be  read  by  the  user,  perhaps  by  as  much 
as  90  percent.  ^1 

Artandi,  in  her  book-indexing  studies,  suggested  as  a  topic  for  further  investigation 
the  possibility  that  proximity  of  index  term  candidates  as  derived  from  the  same  section 
of  the  text  could  serve  to  improve  the  quality  of  the  indexing.    Since  her  computer  program 
checks  for  duplicate  potential  entries  occurring  on  the  same  page,  this  feature  could  be 
used  for  further  analysis,   on  the  assumption  that  the  number  of  occurrences  of  the  same 
entry  for  the  same  page  is  an  indication  of  the  importance  of  the  discussion  of  the  subject 
on  that  page.  ^1 

3.  3.  5  Uses  of  Special  Clues  for  Selection 

Intra-  and  inter- sentence  distances  between  words  are  relatively  crude  examples  of 
clues  to  selection  of  words  and  word-pairs  which,  because  of  their  implied  relationships, 
may  be  especially  significant  for  indexing,  sentence  extraction,  or  document  categoriza- 
tion.   They  can  be  quite  readily  detected  by  machine,  but  the  implication  that  physical 
proximity  is  a  good  measure  of  significant  co-occurrence  is  often  false.    Other  clues 
which  can  be  detected  equally  well,  mechanically,  are  those  which  have  to  do  with  position 
and  format. 


J,/ 

Ray,  1961  [  494],  p.  92. 

Swanson,   1963[583T.  p.  9,  1961  [  586],  pp,298-299. 

II 

See  Black,   1963  [  64] ,  p.  20  and  footnote:    "The  figure  90  percent  is  derived  from 
experience  in  previous  experiments,  wherein  the  amount  of  relevant  material 
was  scanned  and  a  subjective  judgment  was  formed  that  the  relevant  material  was 
actually  about  10  percent  of  the  total  verbiage  retrieved.     That  is,  about  10  percent 
of  each  document  contained  the  relevant  material;  90  percent  of  the  document  was 
of  no  relevajice  but  the  document  as  a  whole  was  relevant.  " 

Artandi,   1963  [  20]  ,  p.  47. 
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Such  obvious  positional  clues  as  occurrences  of  words  in  titles,  chapter  or  section 

headings,  figure  captions,  have  already  been  mentioned.     To  these  can  be  added  first  and 

last  sentences  of  paragraphs,  J_/  or  of  first  and  last  paragraphs  as  such.  ^1  Wyllys 

observes  that  other  criteria  which  are  detectable  in  the  text  by  straightforward  machine 

procedures  can  be  based  on  such  features  as  italicization,  capitalization,  or  punctuation. 

He  notes,  however,  that  such  "editorial"  criteria  vary  from  journal  to  journal  so  that 

their  usefulness  would  need  to  be  related  to  the  particular  practices  of  individual 
-11 

journals.  _' 

Somewhat  more  difficult  for  machine  implementation,  but  certainly  feasible  in  the 
present  state  of  the  programming  art,  is  the  use  of  specific  semantic  or  syntactic  clues. 
Here  again,  Luhn,  Baxendale,  and  Edmundson  and  Wyllys  all  anticipate  their  critics  and 
later  investigators.    Luhn  recognized  the  fact  that  in  at  least  some  applications  the 
characterization  of  documents  by  isolated  words  alone  would  fail  to  provide  an  effective 
degree  of  discrimination.    He,  therefore,   suggested  operations  to  establish  word 
relationships,  whether  based  on  co-occurrences  or  combinations  of  specific  parts  of 
speech.  ^1   Baxendale  clearly  uses  both  syntactic  and  semantic  clues,  detectable  by 
built-in  table  lookups. 

Representative  suggestions  by  Edmundson  or  Wyllys  or  both  as  co-authors  include 
the  following: 

"...  We  have  in  mind  a  glossary  or  dictionary  of  perhaps  one  to  two  thousand 
words  that  act  either  as  cue  words  which  signal  the  importance  of  a  sentence 
or  as  stigma  words  that  signal  the  insignificance  of  a  sentence  for  purposes  of 
abstracting.  "  ^1 


II 


2/ 
5/ 


See,  for  example,  Wyllys,   1963  [  653],  p.  27:    "One  of  the  first  published  studies 
in  automatic  document-content  analysis,  that  of  Miss  Phyllis  Baxendale,  brought 
out  the  importance  of  the  first  and  last  sentences  in  a  paragraph  as  bearers  of 
a  good  deal  of  the  content  of  the  paragraph.  "    See  also  Marthaler,   1863  [  399], 
p.  25. 

Compare  Swanson,   1963  [580],  p.   1:  "...Some  evidence  exists  to  show  that  for 
short  homogeneous  articles  title  and  first  paragraph  are  nearly  as  good  as  full 
text  .  " 

Wyllys,   1963  [  653]  ,  p.  28. 
Luhn,   1959  [384],  p.  5. 
Edmundson,  19 62  [l78],  p.  11. 
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"The  criteria  for  attributing  significance  to  words  .  .  .  may  be  positional  (in  virtue  of 
their  occurrence  in  titles  or  section  headings),  or  semantic  (in  virtue  of  their 
relation  to  words  like  'summary'),  or  perhaps  even  pragmatic  (in  the  case  of  names 
of  specialists  mentioned  in  text  footnotes,  or  bibliography  .  .  . 

"A  cataloguer  or  abstract-writer  would  naturally  give  more  weight  to  a  technical 
word  that  appears  in  a  title,  in  a  first  paragraph,  or  in  a  summary.    A  machine 
can  be  programmed  to  do  the  same.    It  can  be  instructed  to  recognize  the  title  by 
position  and  capitalization  ...  It  can  place  first-paragraph  indications.  .  .    It  can 
test  every  heading  or  subtitle  for  the  words  '"summary'  or  'conclusions'  and  place 
a  summary  indication  after  each  word  in  the  summary  paragraphs.  "  1_/ 

"The  statistical  criteria  ...  by  no  means  exhaust  the  potential  clues  to  the 
representativeness  of  sentences.    Among  other  plausible  clues  are  certain  words 
and  phrases  .  .  .  authors  use  words  such  as  'conclusion',   'demonstrate',  'disclose', 
'prove',   'show',  and  'summary'  (and  related  forms  of  these)  with  high  frequency  in 
sentences  that  contain  concise  statements  about  the  topic  or  topics  of  the  article.  .  . 
The  occurrence  in  a  sentence  of  such  a  phrase  as  'it  was  found  that.  .  .  ',  'the 
experiment  proves.  .  .  ',  or  'the  central  problem  is  .  .  .  '  would  indicate  probably 
even  more  sharply  than  any  single  word  could  that  the  sentence  was  likely  to  be 
highly  representative  of  the  topics.  .  .  "2^/ 

3.  3.  6  Recent  Examples  of  Mixed  Systems  Experimentation 

It  is  quite  obvious  from  the  above  samples  of  suggestions  for  the  use  of  various 
special  clues  for  automatic  extraction,  that  improved  systems  will  largely  depend  upon 
a  mixture  of  means  for  determining  subject- representativeness  of  words,  phrases,  and 
sentences.    Many  of  the  clues  suggested  by  Edmundson  and  Wyllys  are  continuing  to  be 
explored,  as  mixed  systems,  at  RAND       and  the  System  Development  Corporation,  (1962 
[590]),  for  example.    Two  specific  recent  examples  of  mixed  systems  experimentation 
are  the  automatic  abstracting  experiment  programs  at  Thompson  Ramo- Wooldridge  and 
the  work  involving  detection  of  first  incidences  of  nouns  at  the  Harvard  Computation 
Laboratory. 

The  TRW  programs  to  investigate  possibilities  of  computer  generation  of  document 
auto-abstracts,  involving  both  English  and  Russian  language  texts  are  based  upon  a 
combination  of  four  different  methods  to  measure  significance  and  determine  representa- 
tiveness.    These  four  methods  are  briefly  described  as  follows: 

"...  The  Key  method  has  its  source  of  machine  recognizable  clues  the  specific 
characteristics  of  the  body  of  the  document  and  is  based  on  a  Key  Glossary  of 
content  words  taken  from  the  body  pf  the  document. 

Edmundson  and  Wyllys,   1961  [I8I],  pp.  227  and  229. 
Wyllys,  1963  L  653J  ,  p.  25. 

II 

See  National  Science  Foundation's  CR&D  report  No.   11,  [  430]  ,  pp.  314-315. 
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".  .  .  The  Cue  method  has  as  its  source  of  machine  recognizable  clues,  the  general 
characteristics  of  the  corpus  that  are  provided  by  the  bodies  of  the  documents  and 
is  based  on  a  Cue  Dictionary  of  function  words  apt  to  appear  in  the  body  of  a 
document. 


".  .  .  The  Title  method  has  as  its  source  of  machine  recognizable  clues,  the  specific 
characteristics  of  the  skeleton  of  the  document,  i.  e.  ,  title,  headings,   and  format, 
and  is  based  on  a  Title  Glossary  compromising  those  content  words  found  in  the 
title,   subtitles,   and  headings,  but  excluding  certain  words  of  the  Cue  Dictionary. 


".  .  .  The  Location  method  has  as  its  source  of  machine  recognizable  clues,  the 
general  characteristics  of  the  corpus  that  are  provided  by  the  skeletons  of  the 
documents  and  uses  a  Heading  Dictionary  of  certain  fixnction  words  that  appear 
in  the  skeletons  of  documents.  "  1./ 


The  Harvard  work  involving  detection  of  the  first  incidences  of  nouns  as  sentence 
selection  ajid  indexing  clues  is  part  of  a  larger- scale  program  for  mechanized  informa- 
tion selection  and  retrieval  under  the  general  direction  of  Salton  (1961  [512],   1962  [513], 
1963  [514]  and  [515])-    The  specific  mixed  system  involving  frequency  data,  syntactic 
identification  clues,   and  positional  criteria  is  primarily  the  result  of  investigations  by 
Lesk  and  Storm  (1961  [  577],   1962  [  358]).    Related  work  takes  advantage  of  computer 
techniques  for  predictive  syntactic  analysis  and  a^itomatic  dictionary  lookup  also  under 
development  at  the  Harvard  Computation  Laboratory  (Kuno  and  Oettinger,   1963  [  339], 
[340],  [34l]). 


The  Lesk-Storm  experiments  have  involved  investigations  where  the  hypothesis 
assumed  is  that  the  points  in  a  text  where  the  author  has  first  introduced  a  specific  noun 
or  nominal  phrase,  or  where  he  has  used,  with  higher  frequencies,  a  combination  of 
first-referred-to-nouns,  are  most  likely  to  be  especially  indicative  sections  of  text  with 
respect  to  subject-content  representativeness.  The  assumption  is  further, that  areas  in 
which  specific  "new"  ideas,  not  mentioned  previously  in  the  text,  are  first  introduced  is 
particularly  rich  in  topical- content  concentration.  — ' 

The  mixed-system  emphasis  followed  by  Lesk  and  Storm,  however,  is  revealed  in 
the  following  comments: 


"It  is  not,  of  course,   apparent  that  a  count  of  initial  occurrences  of  nouns  ...   is  by 
itself  sufficient  to  reveal  areas  of  significant  information  content  for  purposes  of 
abstracting  or  indexing.    Accordingly,  the  method  suggested  here  must  be  used 
together  with  other  available  means,   and  is  not  expected  to  provide  by  itself  an 
acceptable  abstracting  algorithm.  "  ^1 

In  their  actual  investigations,   Lesk  and  Storm  first  made  manual  counts  of  initial 
noun  occurrences  in  various  sample  texts,  noting  paragraph,   sentence,   and  first 
incidence-of-word  identifications.     The  computer  was  then  used  to  carry  out  three 
distinctive  tasks:  (1)  calculation  of  the  number  of  new  nouns  for  each  sentence  in  the  text; 


Thompson  Ramo  Wooldridge,  1963  [603],  p.  1, 
Lesk  and  Storm,   1962  [  358],  p.  1-6. 

3/ 

Storm,   1961  [577],  pp.  I-l  and  1-2. 


(2)  computation  of  functions  proportional  to  the  number  of  initially  occurring  nouns  for 
each  sentence,  and  (3)  the  preparation  of  a  normalized  graph  for  initial  noun  occurrences 
by  plotting  the  functional  values  against  each  sentence  in  the  text.i./  Sentence  selection 
can  then  proceed  by  processes  to  detect  "peaks"  on  the  graph,  using  a  relative  criterion 
or  weighting  function  to  minimize  the  effect  of  high  first-noun  counts  in  the  beginning 
sentences  of  a  paper. 

Trials  were  made  with  a  number  of  different  weighting  formulas,  and  the  best  of  these 
involved  the  obtaining  of  moving  averages  of  first-noun  counts  over  several  adjacent 
sentences.    A  particular  formula  covering  a  span  of  seven  sentences  gave  results  that 
appear  to  emphasize  contextual  effects  and  to  reduce  the  effects  of  a  particular  single 
sentence  with  a  large  number  of  new  nouns,   such  as  a  listing  of  proper  names.  The 
resulting  abstracts  are  quite  lengthy  (e.  g.  ,  comprising  20  percent  or  more  of  the  original 
text),  and  contain  some  relatively  uninformative  sentences.     The  investigators  think  that 
the  results  with  respect  to  satisfactory  abstracting  are  inconclusive  but  provocative.  They 
also  conclude  that  the  possibilities  for  indexing  are  more  immediately  promising:  "Most 
key  definitions  are  retained  in  the  successful  summaries,  and  the  vocabulary  reflects  the 
topics  covered  in  the  texts.  "  ^/ 

Other  examples  of  mixed- system  experimentation,  especially  involving  the  use  of 
syntactic  and  semantic  considerations,  include  the  work  at  the  General  Electric  Computer 
Department  under  Spangler,  and  work  by  Jacobson  and  Plath.    In  the  Phoenix  laboratories 
of  General  Electric,  a  KWIC  type  indexing  program  can  be  applied  both  to  titles  and  to 
running  text  and  a  contemplated  extension  is  intended  to  "generate  indexes  by  means  of 
word  analysis,  taking  into  consideration  syntactic  and  semantic  aspects  of  text  lines".  3^/ 
Jacobson  describes  rules  for  machine  determinations  of  same-meaning  occurrences  of 
words  which  may  be  homographic  and  for  selection  of  descriptors  for  indexing  simple 
paragraphs  by  choosing  words  occurring  at  least  twice  with  a  high  probability  of  having  the 
same  meaning.         Plath  reports: 

"Although  sentences  occur  in  which  the  key  term  or  phrase  lies  buried 
deep  down  in  the  structure,  preliminary  observations  indicate  that  there 
are  many  others  in  which  the  semantic  hierarchy  closely  parallels  that 
of  the  syntactic  structure.  This  suggests  that  more  sensitive  vocabulary 
statistics  for  purposes  of  automatic  abstracting  may  be  obtainable  by 
considering  only  words  occurring  in  positions  above  a  predermined  cut- 
off level  in  the  sentence  structure.    Alternatively,  one  might  count 
occurrences  of  words  on  each  level,  and  then  multiply  by  a  fixed 

c  / 

weighting  factor  in  each  instance  before  taking  the  overall  totals.  "— ' 

___ 

Lesk  and  Storm,   1962  [  358],  pp.  1-2.  I-iff. 

2/ 

Ibid,  p.  1-31. 

3/  _  ^ 

National  Science  Foundation's  CR&D  Report  No.   11,  L  430J  ,  p.  21. 

4/ 

Jacobson,   1963  [292],  p.  191-192. 

5/ 

Plath,  1962  [474]  ,  p.  190. 
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3.4  Quality  of  Modified  Derivative  Indexing  by  Machine 

Most  of  the  modified  derivative  indexing  techniques  that  have  been  proposed  to  date 
have  few  or  no  indexing  results  to  provide  comparative  data  for  purposes  of  evaluation. 
Moreover,  those  techniques  which  are  primarily  directed  to  the  generation  of  document 
abstracts  rather  than  indexing  terms  have  been  reported  to  date  with  a  paucity  of  actual 
examples.  —  One  of  the  main  reasons  for  this  lack  of  product-effectiveness  data  is  iin- 
questionably  the  high  cost  and  difficulty  of  obtaining  substantial  corpora  of  representative 
document  text  in  machine -readable  form.    For  the  most  part,  the  few  examples  of 
automatic  abstracts  produced  by  machine  are  sadly  lacking  in  pertinency,   relevancy,  ^/ 
and  in  continuity  for  scanning  or  reading  by  comparison  with  conventional  human  abstracts, 
whether  prepared  by  author,  editor,  volunteer  specialist  in  the  subject  field,   or  pro- 
fessional doc\maentalist. 

A  few  studies  have  been  made  for  a  somewhat  larger  numbers  of  examples  of  "auto- 
abstracts"  with  respect  to  differences  between  several  different  machine -extraction 
formulas,  random  sentence  selections,  and  sentences  extracted  manually.    A  project 
conducted  by  IBM's  Advanced  Systems  Development  Division  for  the  ACSI-matic  program, 
(I960  [289],   1961  [290]),  involved  70  to  90  articles  on  military  intelligence  items.  The 
comparisons  were  of  "auto-abstracts"  as  against  titles,  full  texts,  "pseudo-auto- 
abstracts"  comprised  of  the  first  and  last  5  percent  of  the  sentences  of  each  text,  and 
sets  of  sentences  selected  randomly,  without  reference  to  conventional  types  of  manually 
prepared  abstracts  and  without  respect  to  the  quality  as  such.    Similarly,  Thompson 
Ramo  Wooldridge  data  (1963  [6OI])  on  machine -extracted  and  randomly -extracted, 
sentence  sets  compare  these  "abstracts"  against  manual  selection  of  25  percent  of  the 
sentences  of  each  item,   rather  than  against  a  conventional  type  of  abstract. 

There  are  however,  almost  no  data  available  on  the  possible  results  of  using  sentence 
and  word-group  extracting  techniques,  applied  to  machine-usable  texts,  to  the  develop- 
ment of  indexing  entries  rather  than  to  the  generation  of  substitutes  for  document 
abstracts.    For  this  reason,  as  well  as  because  discussion  of  the  difficulties  of  evaluation 
in  general  will  be  deferred  to  a  later  section  of  this  report,  the  question  of  the  quality  of 
modified  derivate  indexing  will  be  briefly  considered  below,  largely  in  terms  of  non- 
quantitative  judgments. 

First  and  foremost,  as  has  been  noted  previously,  is  the  objection  that  word-indexing 
typically  produces  redundancy,   scatter  of  references  among  synonyms  and  near- synonyms , 
inclusion  of  many  irrelevant  entries  at  high  page  and  user- scanning  costs,   omission  of 


Purto  expresses  regret  that  the  studies  of  Agrayev  and  Borodin,  intercomparing 
results  of  human  abstracting,  use  of  Luhn's  method,  and  their  own  modification, 
used  only  a  single  paper  (1962  [484]).    Storm,  (1961  [577]),   evaluating  the  initial 
noun  occurrence  technique  as  a  measure  of  sentence  and  index-term  extraction 
significance,  reports    results  for  only  two  papers,  both  by  Quine.    Only  nine 
articles,  with  no  more  than  40,  000  words  of  text  in  toto,  were  used  by  Edmundson, 
Oswald    and  Wyllys  in  their  I960  experiments  ([I8O]). 

2/ 

Compare,  for  example  Lesk  and  Storm,   1961  [358],  pp.  1-29  and  1-30  as  follows: 
"A  final  problem  is  the  ambiguity  that  may  arise  by  removing  two  sentences  from 
context;  two  sentences  alone  do  not  always  permit    comprehension.    Worse  yet,  the 
meaning  may  actually  be  inverted  upon  removal  from  context.    For  example.  .  .  a 
quote  is  selected  which  an  unsuspecting  reader  might  think  the  author  supports, 
when  he  is  really  attacking  the  position.  " 
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many  properly  indexable  topics  or  points  of  interest  because  the  authors  did  not  emphasize 
them  or  used  new  and  vmusual  terminology  to  describe  them,  failures  to  achieve  con- 
sistency both  of  reference  and  index-vocabulary  control  for  the  papers  of  more  than  one 
author,  and  the  like. 

Additional  difficulties  are  engendered,  for  word  indexing  by  machine  from  text  as 
against  word  indexing  by  people,  because  of  complexities  required  in  programming  to 
achieve  recognition  of  even  such  simple  indicia  as  endings  of  sentences,   L/  inconsis- 
tencies of  capitalization,        and  misspellings.  2.^   Context  distinctions  between  multiple 
meajiings  of  homographic  words  are  even  more  difficult.    Difficulties  in  achieving  good 
indexing  quality  are  increased  if  only  titles  are  used;  those  of  keystroking  and  machine 
cost  requirements  increase  as  the  amount  of  input  material  grows. 

For  these  reasons,  early  criticisms  such  as  those  of  Bar-Hillel  are  largely  as 
pertinent  today  as  they  were  when  statistical  techniques  for  computer  generation  of 
document  extracts  and  index  terms  were  first  proposed.    For  example: 

"There  can  be  no  doubt  but  that  computers  are  in  a  position  to  select  out  of  the 
words  or  word-strings  occurring  in  the  encoded  form  of  the  original  document 
those  words  or  strings  which  fulfill  certain  formal,  statistical  conditionsi  such 
as  occurring  more  than  five  times,  occurring  with  a  relative  frequency  at  least 
double  the  relative  frequency  in  general.  .  .  However,  it  is  ...  unlikely  that  the 
set  obtained  thereby  will  be  of  a  quality  commensurate  with  that  obtained  by  a 
competent  indexer.    First,  there  will  be  serious  difficulties  as  to  what  is  to  be 
regarded  as  instances  of  the  same  word  .  .  .  Second,  there  arises  .  .  .  the  problem 
of  synonyms.    Third,  and  most  important,  this  procedure  will  yield  at  its  best  a 
set  of  words  and  word  strings  exclusively  taken  from  the  document  itself.  "  j^l 

On  the  other  hand,  there  are  many  situations  where,  because  of  time  factors  or  lack 
of  conventional  indexing  resources,  even  unmodified  derivative  indexing  by  machine  is 
itself  of  value  and  therefore  modifications  to  improve  the  quality  of  results,  whether 
made  by  man  or  by  machine,  may  be  well  worthwhile.  As  Anzlowar  claims:  "The  in- 
creasingly widespread  KWIC  indexes  .  .  .  can  save  so  much  in  time  and  effort  that  they 
surely  deserve  better  than  the  somewhat  haphazard  'slash-dash -ing'  now  done  in  most 
in  most  instances  as  the  only  cerebral  operations  thereon.  " 

u 

See  Luhn,   1959  [  384],  p.  22:    "Amongst  the  difficulties  encountered  in  the  processing 
of  machine  readable  texts,  inconsistencies  in  the  use  of  punctuation  marks,  com- 
pounds, capitals,   spacing  and  indentations  have  been  a  problem  way  out  of  propor- 
tion with  respect  to  the  simple  functions  these  devices  stand  for.    For  instance, 
even  with  the  aid  of  a  dozen  different  tests  performed  by  the  machine,  the  true  end 
of  a  sentence  cannot  be  determined  with  certainty.  " 

See  Artandi,   1963  [20],  pp.   52ff,  on  problems  of  capitalization  of  proper  names. 

3/ 

See  Wyllys,   1963  [653],  p.  15. 

±/ 

Bar-Hillel,   1962  [35],  pp.417-418. 

5/ 

Anzlowar,   1963  [l6],  p.  104. 
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Modifications  to  derivative  indexing  techniques  that  tend  toward  normalizations  of 
ternninology  and  word  usage,  and  increasingly  sophisticated  proposals  for  machine  use 
of  syntactic,  semcintic,  and  contextual  clues  hold  out  the  promise  of  transition  to  more 
truly  "subject"  indexing  and  to  automatic  assignment  indexing  systems. 

4.    AUTOMATIC  ASSIGNMENT  INDEXING  TECHNIQUES 

Answers  to  the  question  of  whether  indexing  by  machine  is  possible  are  actually 
dependent  in  part  on  how  the  question  of  whether  what  can  be  achieved  by  machine  is  or 
is  not  properly  termed  "indexing"  is  answered.    If  "indexing"  is  defined  as  being  more 
than  the  mere  extraction  of  words  from  titles,  abstracts,  or  text,  then  automatic 
derivative  indexing,   even  when  augmented  by  various  modifications,  normalizations,  and 
editings,   does  not  provide  affirmative  evidence.    In  the  case  of  concept-oriented 
definitions  of  indexing,   the  question  becomes  one  of  whether  or  not  automatic  assignment 
indexing  is  possible.    Experimental  evidence  suggesting  that  it  is  will  be  presented  in  this 
section. 

We  should  note  first,  however,  that  just  as  there  are  differences  of  opinion  as  to 
what  "indexing"  means  so  there  are  similar  differences,  with  respect  to  whether  or  not 
it  represents  concepts  rather  than  extracted  words.     There  are  also  a  number  of  conflict- 
ing definitions  of  what  is  meant  by  "indexing"  in  contradistinction  to  "classifying".  For 
some,  the  latter  difference  is  related  to  questions  of  the  number  of  labels  or  surrogates 
assigned  to  a  single  item  to  represent  its  subject  contents,   ranging  from  the  assignment 
of  a  single  subject  category  in  a  classification  scheme  involving  mutually  exclusive 
classes  to  the  assignment  of  a  number  of  terms  or  descriptor  each  standing  for  one  of  a 
number  of  aspects  of  the  subject.    For  our  purposes,  however,  we  shall  regard  both  the 
case  of  indexing  with  a  number  of  descriptors  and  that  of  classifying  to  a  single  category 
or  subject  heading  as  being  within  the  province  of  automatic  assignment  indexing,  re- 
serving the  term  "automatic  classification"  for  the  case  where  the  machine  is  used  to 
establish  the  classification  or  categorization  scheme  itself. 

Actual  experiments  in  automatic  assignment  indexing  by  Borko,  Borko  and  Bernick, 
Maron,  Salton,  Stevens  and  Urban,  Swanson,  and  Williams  will  be  discussed  briefly 
below.    These  discussions  are  generally  in  chronological  order  with  respect  to  first 
reporting  of  results,  except  that  the  Salton-Lesk-Storm  work  reflects  a  somewhat  dif- 
ferent principle  of  assignment  from  the  methods  using  clue  word  approaches  and  it  is 
therefore  described  after  these  others  have  been  discussed.    Some  of  the  similarities  and 
differences  between  the  various  methods  are  then  indicated.    A  brief  final  subsection 
covers  related  assignment  indexing  proposals  for  which  experimental  data  is  not  available 
or  has  not  as  yet  been  reported  in  the  literature. 

4.1    Swanson  and  Later  Work  at  Thompson  Ramo- Wooldridge 

Research  on  fully  automatic  indexing  as  well  as  on  full  text  searching  and  retrieval 
at  the  Ramo- Wooldridge  Corporation  has  been  reported  as  being  under  way  at  least  as 
early  as  the  spring  of  1958.  i./  As  described  elsewhere  in  this  report,  experiments  in 
search  and  retrieval  based  upon  full  natural  language  text  had  used  as  test  items  short 
articles  in  the  field  of  nuclear  physics.    In  additional  experiments  representing  a 
preliminary  "clue  word"  approach  to  possibilities  for  automatic  indexing  procedures, 
some  of  this  same  material  was  used. 
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In  these  additional  experiments,  27  articles  in  the  nuclear  physics  subject  area  were 
included  in  a  corpus  of  100  articles,  the  remainder  covering  a  variety  of  topics.  Fre- 
quency counts  of  word  occurrences  for  the  physics  material  were  obtained  and  the  12  most 
frequent  words  that  were  judged  to  be  discriminatory  for  the  subject  were  selected.  The 
hypothesis  was  then  tested,  that  if  any  document  pertained  to  nuclear  physics  it  would 
contain  at  least  two  of  these  words.    Retrieval  was  achieved  for  25  of  the  27  documents 
and  the  two  "irrelevant"  documents  also  retrieved  did  include  information  at  least  peri- 
pherally related  to  the  subject.    It  was  thus  evident  that  the  retrieval  effectiveness  of 
automatic  recognition  of  nuclear  physics  subject  material  in  the  general  collection  was 
considerably  greater  than  the  average  effectiveness  of  retrieving  responses  to  the  highly 
specific  search  questions  in  nuclear  physics  that  had  been  used  in  the  full  text  searching 
experiments  (Swanson,   I96I  [  586]). 

This  second  set  of  experiments  provided  a  transition  from  the  full  text  searching 
work,  which  if  it  can  be  considered  indexing  at  all  is  obviously  derivative  indexing,  to 
work  in  the  application  of  an  automatic  assignment  indexing  method  to  1,  200  newspaper 
clippings  (Swanson,   1962  [  584],   1963  [  580]).    These  were  brief  news  items  for  which 
machine-readable  texts  in  the  form  of  punched  paper  tape  were  available.  Thesaurus- 
groups  of  words  likely  to  be  associated  with  each  of  20    to  24  subject  headings  were  first 
compiled  on  the  basis  of  human  analysis  of  1,000  or  more  representative  items.  These 
word  groups  were  further  screened  so  that  no  word  appeared  in  more  than  one  group  and 
so  that  each  word  retained  should  be  uniquely  indicative  of  the  particular  subject 
category.    In  the  machine  assignment  procedure,   subsequently,  if  a  word  occurs  that 
belongs  to  a  particular  thesaurus  group,  the  corresponding  subject  heading  is  assigned 
to  the  item  in  which  that  word  occurs. 

Results  achieved  with  this  technique  appear  to  be  highly  promising,  at  least  for  this 
type  of  material.    Swanson  reports  as  follows: 

"Approximately  1,200  brief  news  items  were  classified  into  20  nonhierarchical 
subject  categories,  both  by  a  human  and  a  machine  procedure.    Each  item  was 
assigned  on  the  average  to  about  four  categories.    The  results  of  the  two 
processes  were  compared.    With  the  human  process  as  a  standard,  the  machine 
missed  only  seven  percent  of  the  correct  subject  assignments  and  made  a  number 
of  irrelevant  assignments  equal  to  about  17  percent  of  the  total.    Nearly  40  per- 
cent of  the  automatic  subject  assignments  judged  finally  to  be  correct  were 
missed  by  the  human  catalogers.  "\J 

While  this  accomplishment  is  actually  due  to  the  extensive  human  effort  to  compiling, 
organizing,  and  pruning  of  the  uniquely  indivative  word  lists,   it  is  pointed  out  that  this 
intellectual  effort  and  the  programming  tasks  need  to  be  done  only  "once  and  for  all".^'' 
It  is  further  pointed  out  that  garbles  or  misspellings  in  the  input  text  do  not  appear  to 
affect  the  procedure,   there  being  enough  redundancy  in  the  messages  so  that  even  if  one  or 
two  clue  words  are  missed,  others  will  be  present.  ^ 
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Swanson  and  his  TRW  associates  have  further  proposed  extensions  of  the  prespecified 
unique  clue-word  technique.    For  example,  it  is  suggested  that  machine  processes  of 
comparing  words  of  titles,   subtitles  and  chapter  headings  to  lists  of  possible  subject 
heading  can  be  extended  in  sophistication  by  machine  lookups  of  synonym  groups  and  of 
characteristic  subject-word  associations.  ]J  Frequency  weightings  may  be  taken  into 
account,  and  similar  measures  of  association  and  subject-indicativeness  may  be 
developed  for  phrases  as  well  as  for  individual  words .  —     In  general,  however,  the 
apparent  success  of  this  clue-word  technique  in  tests  to  date  should  be  considered  in  the 
light  of  the  special  character  of  the  items,  their  extreme  brevity,   and  the  high  probability 
that  the  fact-word  incidence  involved  in  news  reporting  is  not  typical  of  less  popular  and 
less  factually  oriented  materials. 

Continuing  work  along  similar  lines  has  been  carried  forward  at  Ramo-Wooldridge  in 
the  "Word  Correlation  and  Automatic  Indexing  Program"  sponsored  by  the  Council  on 
Library  Resources  (1959  C  490]  and[49l])-    Here,   the  objectives  are  to  develop  and  apply 
clue-word  techniques  to  material  that  is  much  more  representative  of  the  scientific  and 
technical  literature.     The  thesaurus-groups,  now  called  "indexonym"  groups,   are  made  up 
of  words  and  phrases  selected  by  extensive  human  analysis  as  being  significantly  "useful- 
f or  -  retrieval- purposes". 

New  items  would  be  processed  in  a  word  and  phrase  lookup  operation,  with  each  word 
or  phrase  being  initially  assigned  the  identifier  number  codes  of  all  groups  to  which  it 
belongs.    However,  unless  a  particular  group.'s  number  is  repeated  several  times  within 
the  space  of  a  few  paragraphs,  it  is  not  used  as  the  basis  for  the  actual  assignment  of  an 
index  tag.    Provision  would  be  made  for  calling  human  attention  to  items  having  a  number 
of  words  that  are  not  deleted  by  processing  against  a  "useless-for-retrieval  purposes" 
list,  but  that  are  not  found  in  any  of  "accepted"  groups.    It  is  suggested  that  in  this  way  it 
should  be  possible  to  "ascribe  measures  of  automatically  recognizable  'newness'  to 
technical  articles" .  ^/ 

4.2  Maron's  Automatic  Indexing  Experiments 

By  April  of  1959,   the  reports  of  work  at  Thompson  Ramo-Wooldridge  on  automatic 
indexing  and  related  problems  submitted  for  the  Current  Research  and  Development  in 
Scientific  Documentation  series  included  reference  to  Maron  and  a  "probabilistic  model  for 
the  assignment  of  index  tags",   as  well  as  to  Swanson's  continuing  projects.  ^1 

u 

Swanson,   1962  [584],  p.  469- 
Swanson,   1963  [  580],  pp.  1-2. 
See  also  Mooers,   1963  [424]  . 
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In  addition  to  his  work  on  probabilistic  indexing  with  emphasis  on  relevance 
weightings  for  index  tags  manually  assigned,  Maron  has  actively  explored  automatic 
assignment  indexing      chniques.    The  approach  is  also  probabilistic,  with  emphasis  on 
the  statistics  of  association  between  content-indicative  clue  words  ^and  subject  headings 
manually  assigned  to  sample  documents.     The  experimental  corpus  consisted  of  a  group 
of  abstracts  in  the  field  of  computer  technology  indexed  to  32  subject  categories  designed 
for  the  purposes  of  these  investigations. 

Common  words  such  as  articles  and  prepositions  were  first  excluded.    Next,  words 
occurring  less  than  three  times  were  purged  and  words  such  as  "data"  and  "computer" 
were  also  rejected  because  they  occur  so  frequently  in  this  literature.  Approximately 
1,000  words  remained  after  these  purging  operations.    After  sorting  the  source  docu- 
ments to  their  most  appropriate  subject  categories,   statistical  frequencies  were 
obtained  for  the  co-occurrences  of  the  candidate  clue-words  with  the  categories  and  the 
resulting  listings  were  manually  examined  to  determine  which  words  peaked  in  a 
particular  category.    Eventually,  90  such  words  were  selected. 

The  occurrence  of  one  or  more  of  the  90  clue-words  in  the  text  of  new  documents  was 
then  used  to  predict  the  subject  category  to  which  the  new  item  should  belong,  i./  Tests 
were  rtm  with  two  groups  of  documents,  one  consisting  of  the  source  items  from  which 
the  statistical  frequency  and  word  list  data  had  been  obtained,  and  the  second  group 
consisting  of  145  genuinely  new  items.    For  the  latter  group,  twenty  documents  contained 
no  clue  words  whatever  and  forty  items  had  only  one.    For  the  remaining  85  items  having 
two  or  more  clue  words,  the  results  of  the  computer  assignment  program  were  predic- 
tions of  the  correct  category  in  44,  or  51.8  percent,  of  the  cases. Results  using  the 
source  documents  were  significantly  better,  as  expected,  with  84.  6  percent  accuracy  of 
category  prediction  for  247  items.    Results  were  also  related  to  the  number  of  clue  words 
that  occurred  in  the  test  items,  with  a  prediction  accuracy  of  only  48.  7  percent  for  items 
with  a  single  clue  word  rising  to  100  percent  probability  of  correct  assignment  if  six  or 
more  clue  words  occurred. 

Trachtenberg  (1963  [  608]  )  has  also  considered  a  probabilistic  approach  to  automatic 
indexing  and  categorization  of  documents,   similar  to  that  of  Maron.    He  suggests  the 
investigation  of  two  information  theoretic  measures  with  reference  to  determination  of 
which  of  various  possible  clue  words  are  significantly  discriminating  with  respect  to  the 
different  categories.    He  further  suggests  experiments  using  90  clue  words  and  the 
corpus  used  by  both  Maron  and  Borko,  but  no  actual  results  have  as  yet  been  reported. 

4.  3  Automatic  Indexing  Investigations  of  Borko  and  Bernick 

At  the  System  Development  Corporation,  the  work  of  Borko  {I960  [73]),  and  of 
Borko  and  Bernick  (1962  [77],   1963  [78],   1964  [  79])  in  the  area  of  automatic  indexing 
has  involved  both  automatic  assignment  indexing  and  automatic  classification  techniques. 
They  have  not  only  reported  actual  indexing  results  but  have  provided  data  for  the  inter- 
comparison  of  their  techniques  with  the  experiments  of  Maron  for  the  same  source 
material. 


Note  that  the  word  itself  is  not  necessarily  used  as  an  index  tag  or  label,  as  is  the 
case  for  derivative  indexing  using  an  inclusion  list  approach.     This  is  an  important 
distinction. 

2/ 

Maron,   1961  [  395],  p.  257. 


94 


The  original  Borko  approach  was  based  on  the  principles  of  factor  analysis  as  these 
had  been  developed  for  the  analysis  of  multivariate  date,  especially  in  the  field  of 
psychology.    Borko's  first  experiments  were  directed  to  a  corpus  consisting  of  6l8 
abstracts  in  the  field  of  psychology,  amounting  to  approximately  50,  000  words  of  total 
text  and  6,  800  different  words.     These  words  were  sorted  by  computer  program  into  an 
order  reflecting  their  respective  frequencies  of  occurrence.    For  the  approximately  200 
words  that  occurred  twenty  or  more  times  in  this  corpus,  the  investigator  himself 
selected  90  words  to  serve  as  index  (or,  better,  index-clue)  terms.    A  matrix  was  then 
developed  for  the  frequencies  of  co-occurrence  of  these  words  and  the  documents  in  which 
they  appeared.    From  this,  a  90  x  90  correlation  matrix  was  computed  as  follows: 

"To  compute  the  correlation  coefficient  .  .  .  we  used  the  following  formula 

NSxy  -  {Ex)  (Ey) 
r      —   ■■■         ■  ■—  ■  — 

/[NSx^  -  (Ex)^]  [NSy^  -  (Sy)^] 

Where  N  is  equal  to  the  number  of  documents  (6l8)  and  x  and  y  are  the  terms  being 
correlated.  "  l_/ 

The  term- correlation  matrix  was  then  factor  analyzed  and  the  first  ten  eigenvectors 
were  selected  as  factors  to  be  rotated  and  interpreted.     Borko  emphasizes  that: 

"The  interpretation  must  be  made  by  the  investigator  and  is  based  upon  his  knowledge 
of  the  analytic  procedures  and  the  subject  matter.    There  is,  therefore,  a  degree  of 
subjectivity  in  the  names  selected  for  each  factor.     These  names  may  be  regarded 

I  as  hypotheses  about  the  factor  meaning.  "  ^/ 

I 

I  Following  the  derivation  of  these  "classification  categories"  by  means  of  the  factor 

analysis  technique,  new  items  may  be  assigned  to  the  categories  on  the  basis  of  words 
occurring  in  their  texts  (abstracts)  in  accordance  with  the  following  procedural  steps: 

"1.    Each  document,  in  machine  readable  form,  is  analyzed  by  the  computer. 
A  list  of  the  index  terms  ajid  their  frequencies  of  occurrence  in  each  document 
is  recorded. 

"2.  The  category  or  categories  containing  the  index  term  is  assigned  a  value  equal 
to  the  product  of  the  number  of  occurrences  of  the  word  in  the  abstract  and  the 
normalized  factor  loading  of  the  word  in  the  category.    If  more  than  one  index  term 
appears  in  a  category,  the  products  are  summed. 

i  "3.    After  each  index  term  has  been  considered,  the  category  having  the  highest 

numerical  value  is  selected.  " 


y 
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The  choice  of  90  clue  words  in  Borko's  work  with  abstracts  in  the  field  of  psycho- 
logical literature  was  apparently  dictated  by  a  matrix  size  which  would  be  convenient 
for  computer  manipulation.  }_l  However,  it  happened  to  coincide  with  the  number  of  clue 
words  used  by  Maron  in  his  experiments.    Advantage  was  taken  of  this  coincidence  to 
obtain  comparative  data  on  the  performance  of  the  two  assignment-indexing  techniques 
as  applied  to  the  same  material.     The  260  computer  literature  abstracts  used  by  Maron 
as  source  documents  were  processed  to  derive  a  correlation  matrix  for  Maron's  90 
manually  selected  words,  which  was  then  factor  analyzed.    Several  sets  of  factors  were 
extracted,   rotated,  and  the  results  studied,  with  a  final  selection  of  21  categories  . 

Since  these  automatically  derived  categories  did  not  coincide  with  Maron's  original 
32,  it  was  necessary  to  analyze  manually  the  total  group  of  405  abstracts  (260  "source" 
and  145  "test"  items)  and  assign  them  to  the  new  categories,   then  to  study  the  documents 
falling  into  each  factor-analytically  derived  category  to  determine  which  of  Maron's  90 
clue  words  were  category-indicative,  and  finally  to  substitute  these  words  in  the  Bayesian 
equation  used  by  Maron  so  as  to  predict  which  of  these  classification  categories  his 
probabilistic  method  should  obtain. 

The  same  two  sets  of  260  "source"  and  145  "new"  abstracts  used  by  Maron  were  then 
submitted  to  the  computer  assignment  program  which  compares  the  clue  words  of  a  new 
item  with  the  numeric  values  of  the  predictor  words  for  each  factor  category,  then  com- 
putes the  score  for  each  item  in  all  categories,  and  assigns  the  category  with  the  highest 
score  to  the  item.    For  the  source  items,  Borko  and  Bernick's  results  showed  63.  4 
percent  correctly  classified,  by  comparison  with  the  84.  6  percent  correctness  score 
originally  obtained  for  them  in  Maron's  experiments.    For  the  new  items  the  factor 
analysis  method  scored  48.  9  percent  correct  assignment  by  comparison  with  Maron's 
original  51.8  percent.         The  later  investigators  therefore  concede  that  the  performance 
of  Maron's  technique  was  somewhat  superior  for  the  same  items  using  the  clue  words 
originally  selected  by  Maron. 

Further  experimentation  was  then  carried  out  (Borko  and  Bernick,   1963  [78])  using 
word  frequency  data  for  the  selection  of  a  new  set  of  90  clue  words  and  a  classification 
scheme  for  21  categories  was  again  automatically  derived.     The  405  abstracts  were  again 
manually  classified  to  these  machine-derived  categories  by  five  subject-matter 
specialists  and  the  two  investigators.    Comparative  data  were  then  obtained  for  both  the 
Maron  assignment  formula  and  the  modified  classification  system  assignments  in  terms 
of  agreement  with  the  manual  assignments. 

For  the  source  items,  the  percentage  of  machine  assignments  agreeing  with  those 
made  by  people  was  62.  7  when  the  Bayesian  probability  formula  used  by  Maron  was 
applied  and  61.2  for  the  factor  analysis  score  system.    For  the  new  items,  the 
corresponding  correct  percentages  were  57.9  and  55.9.    Additional  data  compared  the 
effects  of  using  the  original  Maron  words  and  the  frequency-based  word  set  (Borko's 
words)  for  the  same  probability  formula  assignment  method.     While  there  was  an  overlap 
of  approximately  50  percent  between  Maron's  words  and  Borko's  words,  the  findings 
indicated  that: 


Now  increased  to  150  x  150. 
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"...   The  index  words  selected  by  Maron  are  decidedly  specific  to  the  documents 
from  which  they  were  derived  and  are  of  less  generality  than  the  frequency  based 
terms.    The  Bayesian  formula  coupled  with  the  Maron  words  correctly  predicted 
the  classification  of  79.  6%  of  the  documents  inGroup  I  [ 'source  items']  but  only 
45.  5%  of  the  documents  in  Group  II  [  'test  items']  .     The  coupling  of  the  Bayesian 
formula  with  the  Borko  words  resulted  in  a  slight  decrease  in  the  percentage  of 
Group  I  documents  whose  classification  was  correctly  predicted  (62.  7%)  but  in- 
creased  the  percentage  of  correct  prediction  for  Group  II  documents  to  58.  0%.  "  — 

Other  findings  from  the  later  experiments  indicated  that  despite  the  differences  in 
the  two  word-sets,  the  factor  categories  derived  from  them  were  very  similar.    It  was 
also  found  that,  at  least  for  the  source  items  (Q-oup  I),  the  two  machine  techniques  and 
the  manual  process  classified  56.  1  percent  of  the  items  into  the  same  categories.  It 
should  be  noted,  however,  that  in  the  case  of  the  automatic  assignment  methods:  "Eleven 
documents  contained  no  clue  words  and  could  not  be  automatically  classified  by  either 
system.  "  ^/ 

4.4  Williams' Discriminant  Analysis  Method 

The  work  of  Williams  in  automatic  assignment  indexing,   reported  in  the  fall  of 
1963  [642],  has  also  involved  tests  on  abstracts  of  the  computer  literature,  directly 
comparable  to  but  not  necessarily  identical  with  those  used  by  Maron  and  by  Borko  and 
Bernick.    This  work  at  IBM's  Federal  Systems  Division,  Bethesda  is  based  in  part  on 
earlier  work  by  Meadow  which  involved  computer  studies  of  matching  functions  for 
document  word  lists  and  category  word  lists  for  test  items  drawn  from  such  fields  as 
psychology,  law,  computer  abstracts,  and  news  items.        What  has  subsequently  been 
developed  is  termed  a  "discriminant"  method  which  begins  with  hierarchical  classifi- 
cation structure  of  pre-established  subject  categories  and  with  a  small  set  of  sample 
documents  previously  indexed  by  people  into  these  categories.    Frequency  co\ints  of  words 
in  each  of  the  sample  documents  lead  to  computations,  for  each  category,  of  the  theoreti- 
cally probable  frequencies  of  its  most  statistically  significant  words.    For  new  items, 
observed  word  frequencies  are  compared  with  the  theoretical  word- category  associations 
and  a  relevance  value  is  computed  for  the  item  in  terms  of  each  category. 

The  corpus  selected  for  experimentation  consisted  of  400  items  from  "  Computer 
Abstracts  on  Cards".  ^/   These  had  previously  been  indexed  using  a  classification 
structure  of  15  major  categories,  each  of  which  is  divided  in  turn  into  10  subcategories. 
The  experimental  sample,  however,  was  so  selected  as  to  provide  exactly  15  "source" 
items  and  5  "new"  items  for  each  of  5  subdivisions  of  4  of  these  major  categories. 


I  Borko  and  Bernick,   1963  [78],  p.  23. 

I 

2/ 

Ibid,  p.  11. 

Williams ,   1963  [  642],   cites  H.R.  Mead  ow,  "Statistical  Analysis  and  Classification 
of  Documents",  IRAD  "Ssk  No.  0353,  FSD  IBM,  Rockville,  Maryland,   1962,  but 
this  is  apparently  a  company- confidential  document,  containing  proprietary  in- 
formation.   Meadow  gave  an  informal  report  on  her  work  at  the  Computing  Center 
seminars.  University  of  Maryland,  in  March  of  1963. 

Available  on  a  subscription  basis  from  Cambridge  Communications  Corporation, 
Cambridge,  Mass. 
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Discriminant  coefficients  were  then 
all  words  occurring  in  the  sample  items 
with  the  formula: 


computed  at  both  the  major  and  minor  levels  for 
falling  into  one  of  the  20  groups  in  accordance 


"The  discriminant  coefficient  is: 


J 


P. . 


Where: 


m 


The  relative  frequency  of  the  ith  word 
in  the  jth  category. 


P. . 


f. . 


/ 


1 


and 


n 


P. . 


1 


P. . 


The  mean  relative  frequency  per 
category  of  the  ith  word.  }J 


n 


J 


These  coefficients  are  used  both  to  set  up  threshold  values  to  determine  which  words 
should  be  used  in  the  assignment  formulas  and  to  assign  weighting  factors  to  the  words 
themselves . 

The  results  of  the  experiments  to  date  are  based  on  83  items  from  the  "reference 
set"  which  were  not  used  as  source  items.    For  63  items,  78  percent  were  correctly 
classified  at  the  level  of  a  single  major  category  (e.  g.  ,  "Programming",  "Hardware 
Design")  and  also  correctly  classified  at  a  single  subcategory  level,  (e.g.  ,  "Program- 
ming Languages",  "Semiconductor  Devices").     The  20  remaining  items  were  classified 
to  one  major  category  with  an  accuracy  of  95  percent  and  to  two  minor  level  subdivisions 
with  accuracies  of  60  percent  and  75  percent.    Additional  investigations  were  made  on 
the  effects  of  using  a  discrimination  threshold  to  eliminate  insignificant  words  from 
consideration  and  on  the  use  of  weighting  factors  in  the  assignment  calculations. 

4.  5  S  ADS  ACT 

Stevens  and  Urban  at  the  National  Bureau  of  Stcindards  (1963  [  569,   570]  )  have  also 
explored  an  automatic  indexing  technique  that  uses,  as  in  the  experiments  of  Williams, 
a  teaching    sample  or  reference  set  of  previously  indexed  items  to  form  patterns  of  word 
and  index- term  assignment  associations.    However,  there  are  much  less  formal  require- 
ments for  computing  correlation  coefficients  and  no  consideration  is  required  of  either 


Williams  1963  [  642],  p.  l63. 
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;   the  theoretical  probabilities  of  word  occurrence  by  category  or  of  discrimination  co- 

I   efficients  and  thresholds.    Instead,  the  technique  involves  ad  hoc  statistical  associations 

between  the  words  occurring  in  the  title  and  in  the  abstract  of  a  sample  item  and  the 
1   descriptors  previously  assigned  to  that  item.    A  master  selection-word  vocabulary  is 
!   thus  built  up  where  each  word  is  listed  in  terms  of  the  frequencies  of  its  co-occurrence 
I   with  each  of  the  descriptors  with  which  it  has  co-occurred,   regardless  of  whether  or  not 

such  prior  associations  are  either  revelant  or  significant.    No  attempt  has  as  yet  been 

made  to  "purge"  the  resulting  association  lists.    Instead,  reliance  is  placed  on  the 
I   patterns  of  multiple  word  usage  and  of  redundancy  of  words  used  in  titles  and  cited  titles 

of  new  items  to  minimize  the  effects  of  irrelevant  or  accidental  prior  word-descriptor 

associations  and  to  enhance  the  significant  ones. 

I  The  SADSACT  method  (for  "Self  Assigned  Descriptors  from  Self  and  Cited  Titles") 

proceeds  with  the  assumption,  which  it  shares  with  the  arguments  for  citation  indexing 
previously  discussed,   that  the  literature  references  cited  by  an  author  are  indicative  of 
the  subject  content  or  contents  of  his  paper.  X.I   For  the  automatic  indexing  of  new  items, 
their  titles  and  the  titles  of  up  to  ten  bibliographic  references  cited  are  keystroked,  con- 
verted to  punched  cards,  and  fed  to  the  computer.     This  input  material  is  run  against  the 
master  vocabulary  to  obtain  for  each  input  word  which  matches  a  vocabulary  word  a 
"descriptor-selection  score"  for  each  of  the  descriptors  previously  associated  with  that 
word.    These  scores  are  summed  up  for  all  words  and  at  an  appropriate  cutting  level 
those  descriptors  having  the  highest  scores  are  assigned  to  the  new  item. 

Preliminary  results  based  on  the  titles  and  cited  titles  of  items  that  were  "source 
items"  in  the  sense  that  their  titles  and  abstracts  had  been  used  in  the  teaching  sample 
were  reported  at  the  NATO  Advanced  Study  Institute  on  Automatic  Document  Analysis 
held  in  Venice  in  July,   1963.    For  30  items  drawn  from  such  subject  fields  as  computer 
technology,  information  selection  and  retrieval,  mathematical  logic,  pattern  recognition, 
and  operations  research,  all  of  which  had  previously  been  indexed  by  ASTIA  personnel  in 
I960,  the  machine  assigned  64.  8  percent  of  the  descriptors  previously  assigned.  Sub- 
sequent tests  on  genuinely  new  items,  however,  resulted  in  a  drop  to  only  48.  2  percent 
"hit"  accuracy. 

These  "new"  item  results  were  also  evaluated  by  having  several  representative 
users  of  the  collection  analyze  the  test  items  and  assign  descriptors  to  them  from  a  list 
of  the  descriptors  available  to  the  machine.     The  extent  to  which  the  descriptors  assigned 
by  machine  were  also  independently  chosen  by  one  or  more  of  these  indexers  was  then 
checked.    In  general,   the  fewer  descriptors  assigned  by  the  machine,   the  better  was  the 
human  agreement,   ranging  from  47.  4  percent  overall  in  the  case  where  the  machine  had 
assigned  twelve  descriptors  to  each  item  to  76%  agreement  where  the  machine  assigned 
I    only  one.    In  particular,  for  ten  items  which  were  analyzed  by  five  different  indexers, 
j   the  chances  that  one  or  more  would  also  select  the  machine's  first  choice  (highest  scoring) 
||   descriptor  averaged  90  percent. 

j    4.  6  Assignment  Indexing  from  Citation  Data 

Certain  phases  in  the  program  of  investigation  of  information  selection  and  retrieval 
l|  problems  at  the  Harvard  Computation  Laboratory  have  been  mentioned  previously.  The 
i'  work  of  Storm  and  of  Lesk  and  Storm  on  the  use  of  first-noun-occurrences  as  selection 
I   clues  for  both  automatic  indexing  and  abstracting  was  discussed  in  connection  with  tech- 
1    niques  for  improved  derivative  indexing.     The  studies  on  citation  indexing  have  included, 
'     as  noted,   experiments  to  assign  indexing  terms  to  a  new  document  by  finding  the  indexing 

II 

If  necessary  or  desirable,  however,   abstracts  or  portions  of  text  can  be  used  in 
addition  to  or  in  lieu  of  the  cited  titles. 
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terms  previously    assigned  to  the  five  most  "related"  documents,  where  " relatedness"  is 
a  function  of  the  similarity  in  citation  patterns  as  between  the  new  document  and  items  al- 
ready in  the  collection.     The  results  of  such  index  term  assignments  are  reported  as 
identical  to  those  made  by  human  judgment  approximately  50  percent  of  the  time.  ]_l 

More  specifically,  in  an  experiment  using  documents  drawn  from  a  small  collection 
in  the  fields  of  mathematical  linguistics  and  machine  translation,  a  new  item  was  com- 
pared in  terms  of  its  citation  data  with  the  citation  similarity  data  previously  determined 
for  earlier  documents,  and  the  set  of  five  related  documents  was  selected  using  the 
magnitude  of  the  row  similarity  coefficients  obtained  from  links  of  length  one  and  two. 
All  index  terms  occurring  at  least  twice  in  the  set  of  terms  assigned  to  these  related 
items  were  then  assigned  to  tne  new  items.    For  the  ten  "typical"  new  item  cases,  for 
which  comparative  data  are  shown,  the  citation  data  assignment  method  correctly 
assigned,  on  average,  47.  6  percent  of  the  terms  assigned  manually  to  the  same  items.  — 

A  slightly  more  sophisticated  indexing  term  assignment  formula,  described  by  Lesk, 
was  applied  to  additional  test  cases,  but  "failed  to  raise  accuracy  above  fifty  percent".^/ 
For  five  typical  new  cases,  the  improved  method  correctly  assigned  1 1  of  the  ZO  terms 
manually  assigned  to  these  items,   or  an  average  accuracy  of  55.  5  percent  .  ^/ 

4.  7  Similarities  and  Distinctions  among  Assignment  Indexing  Experiments. 

In  Table  2  some  of  the  key  points  of  the  various  automatic  assignment  indexing 
experiments  we  have  discussed  above  are  summarized.    Certain  similarities,  distinctions, 
and  differences  are  to  be  noted.    Borko  and  Bernick  use  the  same  corpus  as  did  Maron 
and  also    re-apply  Maron's  formula  to  a  different  clue-word  set  for  the  same  material. 
Williams  uses  material  similar  to  the  Maron-Borko  computer  corpus.    The  SADSACT 
tests  also  use  some  items  that  might  be  included  in  the  Maron-Borko  and  Williams 
corpora.     The  Swanson  experiments  with  newspaper  clippings  represent  a  quite  different 
class  of  material  consisting  of  brief,  terse,  factual  messages. 


II 

Lesk,   1963  [  357]  ,  p.  V-8. 

Salton,   1962  [  520],  p.  III-41,  Table  9. 

3/ 

Lesk  1963  [  357],  p.  V-7. 
Ibid,  p.  V-8,  Table  3.  . 
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None  of  the  experiments  has  so  far  encompassed  testing  of  anything  but  very  smcill 
test  item  samples  and  the  dangers  of  extrapolating  from  so  small  and  so  specialized 
bodies  of  data  should  be  clearly  recognized.    Mooers  identifies  these  dangers  in  terms  of 

"The  Silent  Postulate: 

(real  people) 

That  (real  documents)  can  somehow 

(real  jobs  to  do) 

be  eliminated  from  the  experimental  study,  and  that  (role-playing  people) 

(substitute  documents) 
(imaginery  jobs) 

can  be  substituted  and  still  give  valid  experimental  results.  "  — 

In  most  of  the  experiments  in  automatic  indexing  conducted  to  date,  indexing  and 
classification  schedules  have  been  especially  designed,  or  evaluations  made,  specifically 
for  the  purposes  of  these  tests.    Williams,  however,  stresses  the  point  that  the  material 
used  in  his  experiments  had  been  "classified  by  professional  indexers  for  the  purposes  of 
actual  retrieval.  "  ^/  A  similar  claim  can  be  made  for  SADSACT,  as  noted  by  Mooers.  ^/ 
Swanson's  news  item  work  also  obviously  relates  to  real  items  and  implies  a  real  job 
to  be  done,  but  is  directed,  as  noted,  to  a  class  of  material  not  generally  comparable  to 
that  found  in  documentation  operations  on  scientific  and  technical  literature. 

In  contrast  with  the  treatment  of  each  document  as  a  self-contained  entity  without 
reference  to  any  other  documents,  as  is  the  case  for  derivative  indexing,  all  of  the 
automatic  assignment  indexing  experiments,  by  virtue  of  the  fact  that  they  are  assign- 
ment techniques,  do  to  some  extent  embody  the  effects  of  a  consensus  of  a  particular 
collection,  or  a  consensus  of  prior  indexing,  or  a  consensus  of  human  subject  content 
analysis  applied  to  sample  documents,  or  some  combination  of  these  effects.    The  SAD- 
SACT method,  in  addition,  wherever  cited  titles  are  available  for  new  items,  takes 
advantage  of  terminology  other  than  the  author's  own  as  a  source  of  clue  words.  Other 
proposed  methods  of  assignment  indexing,  such  as  the  use  by  Salton,  Lesk,  and  Storm  of 
citation-pattern  similarity  data,  would  carry  the  latter  principle  even  further. 


1/ 

Mooers,  1963  [424]  ,  p.  5. 
"Williams,   1963  [642],  p.  1&2. 
Ibid,  p.  5. 
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I  4.  8     Other  Assignment  Indexing  Proposals 


A  few  additional  automatic  assignment  indexing  proposals  are  under  development. 
Examples  for  which  experimental  data  is  not  as  yet  generally  available  include,  for 
example,  work  at  EURATOM,   some  preliminary  experiments  at  Chemical  Abstracts 
Service,  work  at  General  Electric,  Bethesda,  the  proposed  "Multilinde  x"  system  of 
Information  Systems,  Inc.  ,  investigations  by  Slamecka  and  Zunde,  and  a  special  purpose 
development  project  at  Goodyear  Aerospace. 

Meyer-Uhlenried  and  Lustig  report  for  the  EURATOM  developments  as  follows: 

"...  Procedures  are  being  developed  which  allow  based  upon  given  keyword 
lists  first  for  abstracts:    (a)  to  assign  significant  keywords  ajnd  (b)  based 
upon  hierarchically  organized  keyword  lists,  to  assign  the  documents  in 
question  to  specific  subject  fields. 

"Experiments  were  made  at  first  on  narrow  fields  with  so-called  micro- 
thesauri,  they  showed  encouraging  results  when  automatic  and  manual  assign- 
ment were  compared.    Positive  results  depend  of  course  on  the  quality  of  the 
abstracts  and  the  significance  of  the  words  employed  in  them.    It  remains  to 
see  how  far  this  favorable  prognosis  is  confirmed  by  keyword  collections  of 
more  complex  contents."  1/ 

Friedman  and  Dyson  (1961  [Z03])  have  reported  on  manual  experiments  designed  to 
relate  words  occurring  in  a  sample  of  abstracts  from  a  particular  section  of  Chemical 
Abstracts  to  the  title  or  heading  for  that  section.    Significant  words  in  these  abstracts 
were  counted  and  the  number  of  occurrences  as  well  as  the  number  of  different  abstracts 
in  which  they  appeared  were  determined,  with  a  rank  order  listing  as  a  result.  It 
appeared,  from  inspection,  that  it  should  be  feasible  to  develop,  for  each  CA  section,  a 
relatively  small  vocabulary  of  words  that  would  be  descriptive,  and  indicative  of,  the 
subject  matter  contained  in  it.    They  conclude:    "In  our  opinion,  the  results  were  signifi- 
cant, the  small  vocabulary  of  words  did  select  a  large  percentage  of  the  abstracts  in  the 
section  it  was  based  on.  "  ^/ 

A  project  at  Information  Systems  Operations,  General  Electric,  on  possibilities 
for  automatic  indexing  and  abstracting  of  text  has  been  reported  in  the  November  19  62 
issue  of  Current  Research  and  Development. A^The  META  project  (Methods  of  Extracting 
Text  Automatically)  is  said  to  be  concerned  with  the  use  of  statistical,  linguistic,  and 
semantic  criteria  for  axialysis  and  selection  of  significant  words  and  significant  sentences 
from  text.    Computer  programs  are  being  developed  in  modular  fashion  for  the  GE-225 
computer. 


Meyer-Uhlenried  and  Lustig,   1963  [417],  p. 229. 

I  Friedman  and  Dyson,  1961  [203],  p.  10. 

f  3/ 

National  Science  Foundation's  CR&D  report.  No.   11  [430],  p.  97. 


105 


The  proposed  "Multilindex"  system  is  also  based  on  micro- thesauri  or  small 
vocabularies  designed,  by  human  analysis,  for  clue-indications  to  a  relatively  narrow 
subject  field,  together  with  potential  syntactic- semantic  role  indications  built  into  the 
dictionary,   again  by  extensive  human  analysis,  following  the  approaches  previously  taken 
by  A.  Li.  (Lukjanow)  Loewenthal  in  her  suggestions  for  solutions  to  problems  of  mecha- 
nized trajislation.    An  unpublished  proposal-type  brochure  describing  the  system  was 
available  as  of  December  1963.  i.^  As  of  that  date,  also,  demonstration  printouts  were 
available  from  an  IBM  1401  Fortran  program,  illustrating  an  index  compiled  from 
abstract-text  input  and  a  1,  200-word  dictionary  for  documents  in  the  field  of  space  an- 
tenna tracking  radar.  ^1  A  repetoire  of  350  "concepts"  or  indexing  terms  was  involved, 
with  an  average  of  10  assigned  to  22  test  documents,  many  of  these  assigned  terms  being 
identical  to  words  occurring  in  either  the  title  or  the  text  of  the  abstract  of  the  item. 

Slamecka  and  Zunde  have  investigated  the  extent  to  which  the  "notations  -  of- content" 
in  the  system  developed  by  Documentation,  Inc.  for  NASA's  STAR  might  be  derived  by 
machine  techniques  from  the  text  of  the  abstracts  with  enough  normalization- standardi- 
zation via  inclusion  dictionary  lookup  to  qualify  as  an  assignment  indexing  technique. 
These  workers  claim: 

"This  preliminary  investigation  indicates  the  possibility  of  using  the  computer 
to  index  documents  adequately  for  machine  retrieval  by  matching  their  abstracts 
against  an  authoritative  subject-heading  authority  .  .  .  The  inconsistency  inherent  in 
human  indexing  can  be  eliminated  as  the  number  of  terms  derived  from  any  one 
abstract  will  always  be  the  same.    The  abstract  and  its  automatically  derived  set  of 
index  terms  will  always  be  equivalent.  .  .  "\J 

A  final  example  of  other  approaches  to  automatic  assignment  indexing  research,  not 
yet  reported  in  the  open  literature,  is  an  NIH  sponsored  project  at  Goodyear  Aerospace,  in 
cooperation  with  the  Universities  of  Minnesota  and  Rochester  and  Western  Reserve 
University,  looking  toward  an  automatic  classification  procedure  based  on  word  coocur- 
rences  for  a  set  consiting  of  100  four-to-five  page  documents  in  the  field  of  diabetes 
literature.    Programs  for  statistical  analyses  of  the  full  text  of  these  documents,  all  of 
which  have  previously  been  processed  for  the  manual  W.  R.   U.   "telegraphic"  abstracting 
system,  are  being  developed.  1/ 

5.    AUTOMATIC  CLASSIFICATION  AND  CATEGORIZATION 

In  all  the  experimental  work,  to  date,  that  has  been  directed  toward  the  use  of 
computers  and  other  machine -like  techniques  for  the  automatic  indexing  of  documents,  a 


1/ 

"Description  of  MULTILINDEX.    A  mechanized  system  for  indexing  documents, 
storing  information,  retrieving  information",  P.S.  Shane,  Dec.  4,   1963,  In- 
formation Systems,  Inc.,   7720  Wisconsin  Avenue,  Bethesda,  Maryland. 

2/ 

Private  communications,  A.  L.  Loewenthal  and  P.S.  Shane,   Dec.   11,  1963. 

3/ 

Slamecka  and  Zunde,   1963,  [56l],  pp.  139-140. 

1/ 

E.   Tuttle,  private  communication,  Oct.   30,  1963. 
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dichotomy  can  be  observed.    There  is,  on  the  one  hand,  a  spate  of  examples  of  automatic 
derivative  indexing  where  words  used  by  the  author  himself  or  by  human  analysis  are 
sorted  and  arranged,  by  machine,   to  provide  index  listings,   announcement  bulletins,  and 
current  awareness  distribution  notices.     There  are  also,   on  the  other  hand,  at  least  a 
few  instances  of  investigations  where  the  machine  assigns  category  labels,  indexing 
terms,   or  "heads"  and  "headings"  from  a  classification  schedule,   to  new  items. 

In  general,  as  Needham points  out,  proposed  automatic  assignment  indexing  pro- 
cedures can  be  investigated  with  reference  to  a  previously  existing  index  term  vocabulary, 
an  existing  classification  system  or  schedule,  or  to  specially  designed  vocabularies  and 
subject  heading  lists.    On  the  other  hand,  if  it  is  not  known  how  well  existing  systems  do 
in  fact  characterize  documents  and  if  it  is  not  known  whether  all  pertinent  properties  of 
the  documents  have  been  consistently  identified,  then  it  may  be  preferable  to  develop 
methods  for  assigning  documents  to  the  appropriate  class  in  a  classification  system  which 
is  itself  set  up  automatically.  ^1  Needham  also  suggests  still  a  third  possibility:    that  of 
setting  up  automatically  a  classification  within  which  the  subsequent  classifying  of  docu- 
ments is  done  by  hand. 

The  principal  experimental  results,   to  date,   of  attempts  to  achieve  automatic 
classification  of  documentary  items,   especially  in  the  sense  of  machine  -  gene  rated 
groupings  or  categorizations  of  such  items,  have  been  those  of  applying  techniques  of 
"clumping",  .1^  factor  analysis,   and  "latent  class  analysis".  ^1  We  shall  briefly  consider 
below  some  typical  investigations  into  automatic  classification  or  categorization  proce- 
dures that  have  already  had,   or  may  have,  applicability  in  automatic  index  ing  techniques. 

In  the  late  1950's,   Tanimoto  undertook  theoretical  studies  of  mathematical 
approaches  to  problems  of  classification  and  prediction  with  special  reference  to  matrix 
manipulations  of  sets  of  attributes  of  items  to  be  classified.  ^/  He  also  investigated 

Tl 

Needham,   1963,  [432],  p.l. 

Ibid,  p.   1-2:    "If  we  are  to  assign  a  document  to  a  class  automatically,  we  must 
have  a)  a  list  of  facts  about  the  classes  which  will  make  ascription  possible: 
b)  an  algorithm,  usually  some  sort  of  matching  algorithpn,  to  tell  us  which  class 
best  suits  a  document.    Given  a  classification  like  the  U.  D.  C.  ,  it  is  not  at  all 
obvious  that  a)  and  b)  exist,  or  even,  if  they  can  be  found,    a)  and  b)  imply  a  degree 
of  uniformity  about  the  classification  which  may  just  not  be  there.  " 

1/ 

That  is,  the  clustering  of  objects  that  are  in  some  sense  similar  because  they 
share  certain  attributes  or  properties,   even  if,  and  especially  when,   the  identity 
of  cluster-producing  common  properties  is  not  known  in  advance. 

Compare  Doyle,   1963  [162],  p.   13;  "There  are  other  statistical  techniques  besides 
factor  analysis  whose  output  is  document  clusters,  such  as  latent  class  analysis 
and  clump  theory,  and  there  is  a  surprising  increase  in  research  in  this  kind  of 
analysis  just  within  the  last  two  years.  " 

Tanimoto,  1958  [593],   1961  [594].    See  also  Borko,   1963  [76],  pp.  4-5:  "In 

1958,   Tanimoto  published  a  theoretical  paper  on  the  applications  of  mathematics  to 
the  problems  of  classification  and  prediction.    Specifically,  he  pointed  out  how  the 
problems  of  classification  can  be  formulated  in  terms  of  sets  of  attributes  and 
manipulated  as  matrix  functions." 
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theoretical  aspects  of  automatic  indexing  and  sentence  extraction  involving  co-occurrences 
of  words.    While  Tanimoto's  studies  with  respect  to  linguistic  information  processing  for 
classification  purposes  have  apparently  been  limited  to  the  theoretical  considerations, 
similar  concepts  of  probabilistic,   computational,  and  matrix  manipulative  operations  to 
derive  and  use  coefficients  of  correlation  of  associations  between  such  attributes  as  words 
occurring  in  text  or  the  index  terms  assigned  to  documents  are  involved  in  the  factor 
ainalysis  and  theory  of  clumps  techniques  as  applied  in  actual  experiments  in  documentary 
classification. 

5.  1      Factor  Analysis 

The  factor  analysis  technique  which  seeks  to  derive  from  word  associations  in 
representative  documents  an  automatically  generated  classification  schedule  for  use  in 
actual  indexing  experiments  has  previously  been  mentioned.  }_l   Reasons  suggested  for  its 
use  in  research  at  SDC  have  been  reported  as  follows: 

"The  development  of  automatic  procedures  for  purposes  of  classification  and  ab- 
stracting requires  the  identification  and  specification  of  attributes  of  words  or 
passages  so  that  the  relevancy  of  topics  or  content  can  be  determined.  Auto- 
matic procedures  to  detect  such  attributes  may  be  based  on  a  number  of 
characteristics  of  the  text:    word  frequencies,  syntactical  information,  semantic 
information  and  pragmatic  contextual  clues.    Currently,  word  frequency  informa- 
tion can  be  generated  and  manipulated  by  automatic  procedures,  whereas  the 
other  attributes  are  not  as  readily  handled  this  way.    However,  a  correlation 
matrix  of  content  words  becomes  very  unwieldy  because  of  its  size  and  the  com- 
plexity of  relationships.    For  this  reason,  factor  analysis  is  used  to  identify 
clusters  of  relationships.     Current  work  concentrates  primarily  on  determining 
the  usefulness  of  factors  identified  in  this  way  as  classification  and  indexing 
schemes.  "  2^/ 

As  noted  above,  Borko  and  Bernick  (1961  [  73]  ,   1962  [77],   1963  [  78] )  have  applied 
this  technique  to  abstracts  drawn  from  psychological  literature  and  to  the  same  computer 
literature  abstracts  as  had  been  used  by  Maron,  (1961  [  395]).  This  technique  had  also 
been  investigated  in  the  studies  looking  toward  information  retrieval  classification  and 
grouping  undertaken  at  the  Cambridge  Language  Research  Unit  from  about  19  57  onward. 
However,  certain  apparent  limitations  of  the  factor  analysis  approach  led  Parker- Rhodes 
and  Needham  to  the  alternative  of  the  "theory  of  clumps"      (I960  [465],  1961  [435,464]). 
Parker-Rhodes  gives  the  rationale,  and  some  of  the  distinctions  between  the  two  tech- 
niques, as  follows; 

"It  has  been  assumed  that  statistical  methods  could  be  applied  to  the  data  in  such 

a  way  as  to  reveal  any  objectively  existing  classes  which  may  be  there.     The  general 


1/ 

Pp.  94-97   of  this  report. 

System  Development  Corporation,   1962  [590],  p.  15. 
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name  for  the  techniques  evolved  in  this  way  is  factor  analysis.    Insofar  as  it 
is  practically  applicable  this  technique  has  worked  well  enough;  but.  .  .  it  has  two 
limitations  (a)  that  some  classification  problems  are  outside  its  scope,  and 
(b)  that  it  is  not  susceptible  {at  least  as  hitherto  conceived)  of  adaptation  com- 
putationally to  the  study  of  really  large  universes.  •  ■  "  i./ 

".  .  .  The  procedure  of  factor  analysis  first  finds  certain  clumps,  but  then,  as 
output,  it  gives  us  vectors  relating  the  descriptors  of  the  universe  to  the 
clumps  found.  .  . 

"In  most  cases,  factor  analysis  is  used  (especially  in  psychology)  to  debug  the 
descriptor  space;  more  conventionally  put,  to  eliminate  those  tests  (descriptors) 
which  have  an  equivocal  membership  in  several  factors  (Clumps)  in  favor  of 
those  which,  having  more  definite  allegiances,   convey  more  information  of  the 
kind  which  the  analysis  suggests  as  valuable.    It  is  thus  only  related  to  the 
classification  of  the  universe  at  one  remove;  the  classification  it  suggests  is  a 
simple  categorical  classification  defined  by  the  descriptors  suggested  as  the 
most  valuable.  .  . 

"The  descriptive  array  of  a  universe  is  a  table  giving  the  applicability  or 
inapplicability  of  each  descriptor  to  each  element.    To  classify  the  elements 
of  the  universe,     we  calculate  for  every  pair  of  elements  a  similarity  as  a 
function  of  the  corresponding  rows  of  the  descriptive  array,  and  then  regard 
the  similarity  matrix  as  a  sufficient  description  of  the  universe.      In  factor 
analysis,  on  the  contrary,  we  start  with  the  matrix  of  correlations  between 
the  descriptors,  each  being  a  function  of  a  pair  of  columns  of  the  descriptive 
array.  .  .  "  2^/ 

Other  investigators  who  have  considered  factor  analysis  techniques  for  possible 
applications  to  automatic  indexing,  automatic  categorization  of  items  in  a  collection  of 
items,  or  search  prescription  renegotiation  in  a  mechanized  selection  and  retrieval 
system  include  Stiles  (1962  [  573]),  Doyle  (1963  [l62]),  and  Hammond  (1962  [25l]). 

Stiles,     whose  principal  experimental  results  relate  rather  to  the  use  of  statistical 
associations  between  terms  manually  assigned  to  documents  for  search  prescription 
formulation  and  renegotiation  than  to  automatic  indexing  procedures  as  such,  _3/    has  also 
considered  both  automatic  indexing  and  automatic  classification  approaches.  Specifi- 
cally, he  has  made  at  least  preliminary  investigations  of  the  factor  analysis  technique 
independently  developed  for  similar  purposes  by  Borko.    For  a  large  collection  of 
105,  000  items,  the  statistics  of  co-occurrence  of  indexing  terms  were  in  some  cases  not 
as  precise  as  desired  because  the  same  terms  were  used  in  different  senses  for  different 
items  in  the  collection. 


y 

Note  that  Borko  himself  confirms  this  limitation  as  recently  as  November  1963, 
in  stating,   of  the  CLRU  work  on  clumps:    "However,   even  now  these  techniques 
have  been  applied  to  a  346x346  matrix  which  is  beyong  the  capabilities  of  presently 
available  factor  analysis  programs."  (1963  [76]  ,   p.  8). 

Parker- Rhodes,   1961,  [464],  pp.  3-6. 

This  principal  concern  is  discussed  below  with  reference  to  potentially 
related  research,  pp.  119-122     of  this  report. 
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The  possibilities  of  using  factor  analysis  to  sort  out  the  different  meanings  were 
therefore  explored.  ]_l   Using  an  IBM  704  program,  the  centroid  method  of  factor  analysis 
was  applied  to  a  matrix  of  correlation  coefficients  of  terms  that  had  co-occurred  signifi- 
cantly with  the  term  "exposure".     Three  factors  were  derived,  one  generally  relating  to 
the  corrosive  effects  of  exposure,  another  to  "exposure"  in  the  sense  of  photographic 
exposure,  and  the  third  dealing  with  both  exposure-to-weather  and  exposure-to-radiation. 
Although  the  results  were  considered  quite  satisfactory,  more  extensive  experimentation 
and  use  did  not  appear  feasible  because  of  computer  matrix  manipulation  limitations. 


Doyle  notes,  in  particular,  that  factor  analysis  might  be  used  to  give  well-defined 
clusters  separated  one  from  cunother  by  clear  boundaries  rather  than  the  less  precise 
clusters  found  by  most  document  grouping  techniques.    He  emphasizes,  however,  that 
"its  success  in  doing  so  of  course,  depends  on  the  well-defined  clusters  actually  being 
present  in  the  data".  2_/  He  suggests  that  a  combination  of  factor  analysis  and  human 
editing  to  select  items  most  typical  of  statistically  derived  categories  could  be  valuable 
in  such  applications  as  the  sorting  of  Congressional  mail  or  the  identification  of  trends 
in  political  or  military  intelligence  materials  free  from  the  personal  biases  of  an  analyst. 

Hammond  and  his  Datatrol  associates  who  have  worked  on  an  application  of  the 
Stiles  association  factor  technique  for  search  question  negotiation  to  legal  literature  have 
also  considered  the  potentialities  of  factor  analysis.     Thus  they  report: 


".  .  .  The  present  association  factor  gives 
A  factor  analysis  study  would  allow  us  to 
term  to  a  group  of  terms.  From  this  we 
related  to  the  same  concept.  "  ^1 


the  relationship  of  one  term  to  another, 
determine  the  relationship  of  a  single 
could  learn  how  terms  cluster  when 


5.  2      The  Theory  of  Clumps 


It  is  assumed,  in  the  work  on  the  theory  of  clumps,  that  we  have  a  population  of 
objects  or  items  among  which  at  least  some  classes  or  groupings  do  objectively  exist, 
but  that  we  do  not  have  any  bases  for  precisely  determining  class  membership  require- 
ments.   There  may,  therefore,  be  many  possible  ways  of  grouping  and  many  possible 
definitions  of  clumps.    On  the  other  hand,   such  diverse  definitions  must  conform  to  the 
extent  of  some  similarities  of  membership  in  the  clumps  that  they  define  if  in  fact  they 
do  define  any  of  the  existing  classes.    Assuming  further  that  we  are  given  information 
about  properties  ascribable  to  various  members  of  the  population,  it  is  theorized  that 
useful  clumps  can  be  discovered  by  investigating  similarity  connections  between  pairs 
of  items,   such  as  the  number  of  co-occurrences  of  specific  properties.    Thereafter,  only 
these  similarity  connections  are  considered,  and  the  connection  matrix  is  used  as  the 
basis  for  trial  partitions  of  the  population  into  various  possible  subsets. 
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In  early  work  on  clump  definition,  Kuhns  of  Ramo-Wooldridge  J_/  proposed  the  use 
of  a  threshold  value  such  that  if  a  subset  is  a  clump  every  pair  of  members  in  it  has  a 
connection  strength  equal  to  or  greater  than  the  threshold  value  and  no  member  of  the 
subset's  complement  has  connections  of  more  than  threshold  value  to  the  members  of  the 
subset.    In  the  more  extensive  investigations  carried  out  by  Parker-Rhodes  and  Needham 
{i960  [465],   1961  [434,  435,  464]),  other  clump  definitions  have  been  explored  and 
specifically  that  of  the  "GR-Clump".    This  is  defined  as  a  subset  of  the  universe  such 
that  all  its  members  have  a  positive  {or  zero)  bias  to  the  subset  and  all  non-members 
have  a  negative  bias  to  it,  where  bias  is  defined  as  the  excess  (positive  or  negative)  of  the 
total  connections  of  a  member  of  the  population  to  the  members  of  the  subset  over  its 
total  connections  to  the  members  of  the  subset's  complement,  following  the  convention 
that  the  connection  of  the  element  to  itself  is  taken  as  zero. 

An  iterative  procedure  for  discovering  GR-clumps  can  now  be  followed.     This  is 
based  on  an  arbitrary  initial  partition  of  the  given  universe  of  elements  into  a  subset  and 
its  complement.     Then,   since    each  element  has  a  bias  toward  both  the  subset  and  its 
complement,  differing  only  in  sign,  the  biases  of  each  element  are  computed.    If  the  bias 
of  a  particular  element  is  positive  with  respect  to  the  subset,  it  is  transferred  to  the  sub- 
set if  it  is  not  already  a  member  of  it,  and  conversely  if  its  bias  is  negative,  it  is  trans- 
ferred to  the  subset's  complement  if  it  is  not  already  there.    Each  time  a  transfer  is 
made,  the  biases  are  recomputed  and  the  process  is  repeated  until  for  a  complete  scan  of 
all  elements  no  further  transfers  can  be  made.     The  result  is  a  GR- clump  even  though  it 
may  have  no  members  or  may  contain  all  the  elements  of  the  universe.    In  such  case,  a 
further  partition  is  made  and  the  procedures  are  re-applied. 

These  GR-clump  finding  procedures  have  been  applied  to  such  diverse  collections 
of  items  to  be  classified  as  archaeological  artefacts  and  patients'  symptoms  as  related 
to  specific  disease  diagnosis.    In  the  latter  case,  groupings  were  obtained  that  corre- 
sponded satisfactorily  to  certain  specific  disease  syndromes,  but  no  group  was  found 
corresponding  to  Hodgkin's  disease  where  a  great  variety  of  symptoms  typically  occur. 
Needham  comments:    "I  can  scarcely  conceive  of  a  clump  definition  that  would  be  likely 
to  group  these  patients;  I  am  unsure  whether  this  is  a  reflection  on  clump  theory  or  on 
Hodgkin's  disease.  "  ^/ 

In  applications  more  directly  related  to  documentation,   some  investigations  have 
been  made  of  the  use  of  co-occurrence  coefficients  of  index  terms  assigned  to  documents 
in  order  to  form  a  connection  matrix  from  which  clumps  were  then  derived  (Needham, 
1963  [431]).    These  experiments  covered  34Z  terms  occurring  more  than  once  in  the 
index-term  sets  assigned  to  several  hundred  documents  in  the  general  subject   field  of 
machine  translation.    Computation  of  the  matrix  required  20  minutes  of  computer  time 
and  the  40    clumps  found  took  6-8  minutes  each  to  find.    Needham  reports  on  the  results 
as  follows; 
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See  Kuhns,  1959  [  336],  and  Needham,  1961  [435],  pp.  20-21. 
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"Evaluation  of  the  results  was  unexpectedly  difficult.     The  acid  test  is  presumably 
the  efficiency  of  the  retrieval  system  embodying  the  grouping  given  by  the  program; 
but  the  efficiency  of  retrieval  systems  cannot  be  easily  measured.    An  apparently 
simpler  test  would  be  to  see  if  the  clumps  were  intudtively  satisfactory,  i.  e.  ,  were 
groupings  that  a  classifier  in  his  right  mind  could  have  made.    This  also  was  un- 
satisfactory   because  the  groups  are  mostly  rather  large,  larger  in  fact  than 
classifiers  ordinarily  make,  and  were  thus  very  difficult  to  judge.    The  test 
eventually  adopted  was  to  group  the  terms  not  distinguished  by  the  clump  classifi- 
cation, and  look  at  these.    Accordingly,  for  each  term,  a  list  of  the  clumps  to 
which  it  belongs  was  prepared,   and  groups  of  terms  were  found  which  had  all 
their  clumps  in  common.     These  groups  were  quite  small  (2-6  terms)  and  could 
be  studied  easily.    It  turned  out  that  some  groups  were  ones  of  which  a  human 
classifier  could  have  thought  (e.  g.  ,  words  concerning  suffix  removal  for  machine 
translation  came  together)  while  others  were  quite  justified  by  the  documents  con- 
cerned, but  would  never  have  been  thought  of  a  priori.    For  example,  the  group: 
"phrase  marker,  phoneme,  Markov  process,  terminal  language"  was  entirely 
justified  by  the.  .  .  contents  of  the  library.    It  is  groups  of  the  latter  kind  that 
represent  a  success  for  clump  theory,  for  they  function  usefully  in  retrieval  but 
in  no  way  form  part  of  the  structure  of  thought.  .  .which  the  human  classifier's  work 
is  likely  to  reflect.  "  i./ 

Still  another  application  of  the  theory  of  clumps  may  be  of  use  in  the  construction  of 
thesauri  (Sparck- Jones,   1962  [564].    Here  the  assumption  is  that  rows  of  a  correlation 
matrix  can  be  formed  for  words  giving  other  words  which  are  synonymous  with  respect  to 
meaning.     The  overlaps  of  the  same  word's  occurrence  in  two  or  more  rows  can  then  be 
used  to  find  clumps  which  are  presumed  to  represent  conceptual  groupings. 

Applications  of  clump  theory  to  problems  of  mechanized  documentation  are  also 
being  investigated  by  Dale  and  Dale  of  the  Linguistics  Research  Center,  the  University  of 
Texas.         They  have  begun  experimentation  to  derive  clumps  for  the  90  clue  words  used 
by  Borko  and  the  260  source-item  computer  abstracts  used  by  both  Maron  and  Borko. 
Preliminary  results  reported  so  far  are  principally  limited  to  considerations  of  the  asso- 
ciative networks  between  terms  as  derived  from  the  structure  of  the  clumps  discovered 
by  several  clump  definitions.    Mention  should  also  be  made  of  the  work  of  Meetham  and 
Vaswani  at  the  National  Physical  Laboratory,  Teddington,  England,  looking  toward  the 
use  of  similar  techniques  for  machine-generated  index  vocabularies,  with  preliminary 
emphasis  on  testing  them  against  a  "library"  consisting  of  the  propositions  of  Euclid's 
geometry.  ^1 
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Needham,   1963  [43l],  p.  285-286. 

y 

Dale  and  Dale,  an  unpublished  report  dated  February  1964,  [  147]  . 

National  Science  Foundation's  CR&D  report  No.  11,  [430],  p.  137;  and  Meetham, 
1963  [413]. 
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5.  3     Latent  Class  Analysis 

Like  the  earlier  work  of  Tanimoto,  the  latent  class  analysis  approach  of  Baker  (1962 
[27])to  problems  of  automatic  information  classification  and  retrieval  is  at  least  to  date 
theoretical  rather  than  experimental  in  nature,   and  so  will  be  considered  only  briefly  here. 
Baker  claims  that  the  latent  class  model  developed  in  the  field  of  the  sociological  sciences 
for  the  determination  of  latent  classes  among  individuals  responding  "yes"  or  "no"  to 
items  in  a  questionnaire  would  have  attractive  features  for  application  to  information 
categorization  and  search,  because  the  model  is  based  upon  response  patterns  that  are 
analogous  to  the  presence  or  absence  of  clue  words  or  phrases  in  documents  and  because 
the  analysis  yields  an  ordering  ratio  that  could  serve  a  function  similar  to  the  relevance 
weightings  suggested  by  Maron  and  Kuhns. 

This  ordering  ratio  is  the  probability  that  a  given  pattern  of  clue  words  will  occur 
in  a  document  properly  belonging  to  a  particular  latent  class.     The  probabilities  of  the 
same  pattern  being  generated  by  a  document  properly  belonging  to  other  classes  are  also 
provided,   giving  an  uncertainty  which  Baker  thinks  justifiable  because  a  "document  could 
generate  a  given  pattern  of  key  words,  yet  not  belong  to  the  same  area  of  interest  as  the 
majority  of  documents  possessing  the  same  pattern  of  keywords".        It  should  be  noted, 
however,   that  the  question  of  how  to  select  appropriate  clue  words  is  begged  li^  and  that 
no  computer  programs  are  as  yet  available  for  carrying  out  latent  class  analyses,  l.^ 

5.  4     Examples  of  Other  Proposed  Classificatory  Techniques 

There  are  certain  other  document  classificatory  techniques  that  have  been  proposed 
and  to  some  extent  investigated  experimentally.    Trials  of  document  clusterings  based 
on  CO- citingness ,   co- citednes  s ,   or  bibliographic  coupling  as  compared  with  subject  con- 
tent groupings  have,   as  noted  above,   been  conducted  both  by  Kessler  at  the  M.  I.  T. 
Libraries  and  by  Salton's  group  at  Harvard.—''   Consideration  of  Doyle's  work  on  word 
co-occurrence  statistics  has  been  deliberately  deferred  to  a  later  section  which  covers 
his  general  "association  map"  approach.    Similarly,   several  other  investigations  will  be 
discussed  in  terms  of  potentially  related  research  such  as  linguistic  data  processing. 

Two  particular  examples  of  other  suggested  classificatory  techniques  for  document 
grouping  or  classification  are  somewhat  unusual,  however.    These  are  the  methods  pro- 
posed by  Te  Nuyl  and  by  Lefkovitz  (1963  [  353]).     Cleverdon  and  Mills  comment  on  Te 
Nuyl's  method  as  follows: 
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Baker,   1962  [27]  ,  p.  518. 
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Ibid,  p.  517.    Note  also  that  the  footnote  states:    "A  referee  of  this  paper  has  proper- 
ly cautioned  that  the  effectiveness  of  an  information  retrieval  system  may  be  due 
more  to  the  appropriateness  of  the  key  words  than  the  subsequent  processing."  See 
also  Hillman,   1963  [272],  p.  323:    "Baker's  the  ory,  however,  is  based  on  inter- 
relationships of  key  words,   and  thus  constitutes  an  approach  which  is  regarded  with 
some  suspicion  by  Farradane,  who  thinks  that  the  real  problem  concerns  the  inter- 
relationships of  the  concepts  which  key  words  denote.  " 
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Baker,   1962  [27]  ,  p.  516. 

See  Kessler,   1963  [  320]  ;  Lesk,   1963  [  356,   357],  and  p .   3  0  of  this  report. 


113 


"Te  Nuyl.  .  .uses,  as  quasi-descriptors ,  word-sets  chosen  from  the  Oxford  English 
Dictionary  (e.g.  ,  any  word  falling  between  A- Ah)  and  relies  on  the  subsequent 
correlation  of  terms  to  make  sense  of  his  seemingly  bizarre  choice.  "  }_l 

Lefkovitz  is  concerned  with  the  so-called  "automatic  stratification"  of  a  file  in 
which  both  generic  or  associative  relationships  and  exclusive  partitioning  is  used  to 
facilitate  search.    He  claims: 

"...  The  exclusive  partitioning  implies  a  separation  of  descriptors  into  groups 
such  that  no  two  descriptors  in  a  group  co-occur  in  any  given  document  description 
of  the  file.    This  arrangement  presents  the  dissociative  properties  of  the  file,  or 
forbidden  combinations.    When  coupled  with  a  superimposed  display  of  the 
'inclusive'  or  associative  properties  of  the  file  a  unique  classification  of  the 
descriptors  of  this  file  results,  which  is  based  solely  upon  the  association  of  the 
descriptors  themselves  within  the  document  descriptions  and  not  upon  an  arbitrary 
set  of  classes  constructed  by  professional  indexers."  ^/ 

The  purpose  is  to  assist  the  searcher  by  warning  him  that  if  he  chooses  more  than 
one  descriptor  from  any  one  group  as  terms  in  his  search  request,  there  will  be  a  null 
response  from  this  particular  file.    However,   the  particular  application  considered 
involves  a  limited  number  of  highly  quantifiable  or  scalable  "attribute-value"  pairs,  (for 
so  the  descriptors  involved  are  defined),   such  as  "Age-23",  and  "Hair-red".    It  is  by 
no  means  obvious  that  comparable  exclusive  partitionings  could  be  achieved  for  literature 
items  or  that  the  recomputations  necessary  as  new  items  enter  the  file  can  be  achieved 
on  a  practical  basis. 

6.    OTHER  POTENTIALLY  RELATED  RESEARCH 

In  this  section  we  shall  consider  certain  areas  of  potentially  related  research  that 
may  prove  applicable  to  the  improvement  of  automatic  indexing  techniques.    First  is  the 
area  of  thesaurus  construction  and  use,  which  in  turn  is  somewhat  related  to  the  develop- 
ment of  statistical  association  techniques,  especially  for  "indexing-at-time-of-search" 
and  search  renegotiations.    Natural  language  text  searching  will  also  be  briefly 
considered,  together  with  related  research  in  the  general  area  of  linguistic  data 
processing. 

6.  1      Thesaurus  Construction,  Use,  and  Up-Dating 

The  first  area  of  potentially  related  research  which  promises  improvements  in 
automatic  indexing  procedures  is  that  of  thesaurus  lookups  by  machine.    There  are 
several  different  possible  definitions  of  the  word  "thesaurus"  in  the  context  of  informa- 
tion storage,  selection  and  retrieval  systems.    The  first  is  that  it  is  a  prescriptive 
indexing  aid,   or  authority  list,   serving  the  function  of  normalizing  the  indexing  language, 
primarily  by  the  use  of  a  single  word  form  for  words  occurring  in  various  inflections,  by 
the  reduction  of  synonyms,  and  by  the  introduction  of  appropriate  syndetic  devices.  The 
second  definition  relates  to  the  intended  function  for  the  provocation  and  suggestion  to 
the  indexer  or  the  searcher  of  additional  terms  and  clues,  and  it  follows  the  idea  of  word 
groupings  related  to  concepts  as  in  a  traditional  thesaurus  like  Roget's.     The  third 
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possible  definition  involves  the  special  case  of  devices  or  techniques  which  display  or  use 
prior  associations  and  co-occurrences  or  words,  indexing  terms,   and  related  documents 
to  provide  a  guide  or  suggestive  indexing  and  search-prescription-formulation  or 
renegotiation  aid. 

The  idea  of  a  mechanized  authority  list,  following  the  restrictive  first  definition, 
has  been  proposed  by  a  number  of  investigators  i_/  and  has  actually  been  used  in  computer 
programs  as  discussed    for  example  aySchultz  and  Shepherd  {I960  [  532] ), Shepherd  (1963 
[  545] )  and  Artandi  (1963  [ZO]).    It  is  the  second  definition  of  thesaurus  with  which  we 
shall  be  principally  concerned.    It  is,  as  we  have  said,  close  to  the  conventional  idea  of 
such  alhesaurus  as  Roget's.    It  is  based  on  the  hypothesis  that  patterns  of  co-occurrences 
of  words  in  a  new  item  or  in  a  search  request  can  be  compared  with  patterns  of  prior  co- 
occurrences,  as  given  by  a  thesaurus  "head",  in  order  to  expand,   clarify,   or  pin-point 
"meaning"  and  thus  provide  a  more  effective  indication  of  the  true  subject  content.  The 
third  definition  will  be  considered  as  falling  within  the  more  general  scope  of  statistical 
association  techniques,  although  as  Giuliano  points  out,  "a  retrieval  system  embodying 
an  automatic  thesaurus  thus  qualifies  as  being  'associative'."  ^/ 

The  application  of  a  thesaurus -like  approach  to  indexing  and  searching  problems  is 
again  an  area  in  which  Luhn  is  one  of  the  earliest  proponents.    In  January  1953,  he 
proposed  a  new  method  of  recording  and  searching  information  in  which  a  special  diction- 
ary would  be  compiled  for  use  in  broadening  the  terms  of  a  search  request  and  in 
normalizing  word  usage  as  between  various  indexers  (recorders)  and  searchers.  Al- 
though he  did  not  then  use  the  term  "Thesaurus"  as  such,  he  said  in  part: 

"The  process  of  broadening  the  concept  involves  the  compilation  of  a  dictionary 
wherein  key  terms  of  desired  broadness  may  be  found  to  replace  unduly  specific 
terms,  the  latter  being  treated  as  synonyms  of  a  higher  order  than  ordinarily 
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See,  for  example,  "Summary  of  discussions.  Area  5,  "  ICSI,   1959  [578], 
p.   1263:    "Two  further  complications  arise  from  a  mechanical  index. 
Some  articles  might  deserve  as  an  indexing  term  a  word  not  contained 
in  the  article.    By  an  authority  list,  the  product  of  the  mechanized  indexing 
procedure  might  have  such  additional  words  added  to  it.    Again,  an  article 
might  use  a  particular  word  but  the  vocabulary  of  the  system   might  prefer 
another  one.    This  also  can  be  handled  by  a  mechanized  authority  list.". 
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considered.    Translating  criteria  into  these  key  terms  is  a  process  of  normalization 
which  will  eliminate  many  disagreements  in  the  choice  of  specific  terms  amongst 
recorders,  amongst  inquirers,  and  amongst  the  two  groups,  by  merging  the  terms 
at  issue  into  a  single  key  term.    However,  the  dictionary  does  nott  classify  or  index 
but  maintains  the  idea  of  being  fields,  .  .A  specific  term  may  appear  under  the 
heading  of  several  key  terms  and  if  according  to  its  application  an  overlapping  of 
concepts  exists  then  the  term  is  represented  by  the  several  key  terms 
involved.  .."_!_/ 

In  subsequent  papers,  Luhn  has  developed  related  ideas  of  a  "family  of  notions"  and 
"dictionaries  of  notional  families".  —I  In  particular,  he  emphasizes  that  for  automatic 
indexing,  by  contrast  with  automatic  abstracting,  consideration  should  be  given  to  the 
normalization  of  variations  in  author-chosen  terminology:    "It  will  be  necessary  for  a 
machine  to  resolve  variation  of  word  usage  with  the  aid  of  a  device  the  ftmctions  of  which 
resemble  a  dictionary  at  one  level  and  of  a  thesaurus  at  another  level  of  requirements."  3^/ 

The  first  issue  of  the  National  Science  Foundation's  compendium  of  project  state- 
ments, "Current  Research  and  Development  in  Scientific  Documentation",  which  appeared 
in  July  1957  [  430]  reported  several  projects  of  interest  in  terms  of  thesaurus  construc- 
tion and  use,  J^/namely:    (1)  work  by  Luhn  at  IBM  involving  the  establishment  of  a 
thesaurus  to  facilitate  encoding  of  items  whose  texts  would  be  available  in  machine-usable 
form,  (2)  work  by  Bernier  and  Heumann  at  Chemical  Abstracts  Service  looking  toward  the 
development  of  a  technical  thesaurus,  (1957  [57]),  and  (3)  an  approach  to  mechanized 
translation  proposing  to  use  a  mechanized  thesaurus  at  the  Cambridge  Language  Research 
Unit  .     This  latter  project  incorporated  the  ideas  of  Masterman  and  her  associates  from 
about  1956  on  (Halliday  1956  [  Z49]  ,  Masterman,   1 956  [  403]  ;  Joyce  and  Needham,  1958 
[  305]),  to  apply  the  principle  of  checking  co-occurrences  of  text  words  against  thesaurus 
"heads"  to  which  they  belonged,  in  order  to  resolve  homographic  ajnbiguities  and  thus 
achieve  more  idiomatic  translation  by  machine. 

For  the  ICSI  Conference  in  1958,  Masterman,  Needham  and  Sparck-Jones  prepared 
a  paper  discussing  analogies  between  machine  translation  and  information  retrieval,  and 
recapitulated  the  arguments  of  Needham  and  Joyce  for  the  use  of  a  thesaurus  in  the 
formulation  of  search  requests,  as  follows: 

"If  a  large  number  of  terms  are  used  to  describe  a  document,  the  existence  of 

synonyms  is  likely:    in  a  system  such  as  Uniterm  no  attempt  is  made  to  bracket 

the  synonyms,  whichi  means  that  a  request  will  produce  only  the  document  described 
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Luhn,  1953  [  383]  ,  p.  15. 
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Luhn,  1959  [ 371] ,  p. 51,  1959  [  384] ;  1957  [  385]  ,  p.  31 6. 
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Luhn,   1959  [  384]  ,  p.  12. 

±1 

National  Science  Foundation's  CR&D  Report  No.   1,  [  430],  pp.  21,  6,4. 
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in  identical  terms  and  not  in  synonymous  ones.    If  the  existence  of  synonyms 
is  avoided,  by  using  a  small  number  of  exclusive  descriptors,  the  description 
of  a  document  in  terms  useful  for  retrieval  is  more  difficult,  also  it  is  equally 
difficult  to  relate  a  request  to  the  description  of  documents.    A  further  difficulty 
is  that  descriptions  only  list  the  main  terms,  emd  take  no  account  of  their  relations 
to  one  another.    The  C.  L.  R.  U.  experiments  being  carried  out  make  use  of  a 
thesaurus,  a  procedure  through  which  it  is  hoped  that  these  difficulties  will  be 
avoided  and  that  a  request  for  a  document  although  not  using  the  same  terms  as 
those  in  the  document  will  produce  that  document  and  others  dealing  with  the 
same  problem,  but  described  in  different,  though  synonymous,  terms."  1./ 

In  general,  the  use  of  a  thesaurus  to  constrain  variations  in  word  or  term  usage 
(as  in  our  first  definition,  a  mechanized  authority  list),  to  reduce  synonymity,  to  resolve 
homographic  ambiguity,  to  provoke  and  suggest  additional  terms  or  ideas  to  indexer  and 
to  searcher  alike,  is  related  to  the  improvement  of  automatic  indexing  procedures  in 
precisely  the  same  sense  that  its  use  would  be  effective  in  any  indexing  system  whatso- 
ever.   In  another  sense,  however,  the  construction  and  use  of  the  thesaurus  is  related 
to  linguistic  data  processing  by  machine  in  another  way.    Garvin  suggests: 

".  .  .One  may  reasonably  expect  to  arrive  at  a  semantic  classification  of  the  content- 
bearing  elements  of  a  language  which  is  inductively  inferred  from  the  study  of 
text,  rather  than  superimposed  from  some  viewpoint  external  to  the  structure  of  the 
language.    Such  a  classification  can  be  expected  to  yield  more  reliable  answers  to 
the  problems  of  synonymy  and  content  representation  than  the  existing  thesauri 
and  synonym  lists,  which  are  based  mainly  on  intuitively  perceived  similarities 
without  adequate  empirical  controls.  "  ^/ 

This  is  with  respect  to  the  recognition  that  the  machine  itself  can  be  used  to  compile 
and  construct  the  thesaurus.    While  Luhn  in  some  of  his  1957-8  proposals  still  considered 
the  compilation  and  organization  of  a  thesaurus  to  be  primarily  a  matter  of  human  effort, 
he  nevertheless  pointed  out  that:    "The  statistical  material  that  may  be  required  in  the 
manual  compilation  of  dictionaries  and  thesauri  may  be    derived  frorr.  the  original  texts 
in  any  desired  form  and  degree  of  detail.  "  —    De  Grolier  makes  the  complementary 
statement  that  the  Luhn  techniques  should  "considerably  facilitate"  the  preparation  of 
thesauri.  ^1 

Even  more  importantly,  the  computer  can  be  used  for  periodic  up-datings  and 
revisions.    The  work  on  the  FASEB  index-term  normalization  procedures  involved  early 
recognition  of  the  need  to  "educate  the  thesaurus"  by  examining  print-outs  when  no 
matches  occurred  and  providing  a  continuous  process  of  amendment.  Computer- 
maintained  statistics  of  word  and  term  usages  are  closely  related  to  possibilities  for 

II 

Masterman,  Needham,  and  Sparck-Jones,   1958  [405],  p.  934-935;  Needham  and 
Joyce  1958  [  305]  . 

2/ 

Garvin,  1961  [224],  p.  138. 

3/ 

Luhn,   1959  [  354] ,  p.  12. 

De  Grolier,   1962  [l52],  p.  132. 

Shepherd,   1963  [  545]  ,  p.  392. 
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construction  and  revision  of  a  mechanized  thesaurus,   as  again  Luhn  has  suggested.  — 
Schultz  suggests  that  machine  records  should  be  maintained  of  what  thesaurus  terms  are 
actually  used  for  indexing  and  searching,  the  frequencies  of  term  usage,  the  co- 
occurrences, the  number  of  items  described  by  particular  combinations  of  terms  and  the 
like.  2/ 

The  potential  combinations  of  natural  text  processing,  automatic  indexing,  and 
thesaurus  construction  and  updating  are  stressed  in  many  current  programs.  For 
example,  Eldridge  and  Dennis  discuss: 

"Indexing  by  machine  from  natural  text  in  a  fully  automatic  system,  in  which 
statistical  analysis  of  the  words  is  employed  as  a  device  for  (a)  building  auto- 
matically a  'concept'  thesaurus,  (b)  indexing  incoming  documents  with  reference 
to  the  thesaurus,   amd  (c)  continuously  revising  the  thesaurus  to  reflect  new  word 
usages  in  currently  incoming  documents." 

Similarly,  Giuliano  and  Jones  suggest  that  given  a  term -term  statistical  association 
matrix,  a  transformation  can  be  arrived  at  with  a  unit  vector  assigning  value  only  to 
index  term  Z  that  ranks  every  other  index  term  according  to  degree  of  association  with  Z, 
then  by  listing  the  higher  ranked  terms  for  each  term  Z,   "a  'thesaurus'  listing  can  be 
obtained  completely  automatically.  "  IL/ 

6.2      Statistical  Association  Techniques 

A  special  definition  of  the  word  "thesaurus"  might,  as  we  have  noted,  include  the 
development  of  devices  and  techniques  which  either  automatically  or  by  man -machine  inter- 
action serve  to  suggest  the  amplification  of  a  set  of  index  terms.    We  shall  briefly  con- 
sider here  both  devices  that  visually  display  associations  between  words,  terms,  and 
documents  \1  and  techniques  for  machine  use  of  coefficients  of  correlation  for  prior  co- 
occurrences in  a  collection  of  word -word,  word-term,  term-term,  term-document,  and 
document -document  associations,  the  statistical  association  factor  technique  as  first 
developed  by  Stiles. 

II  

Luhn,   1957  [  385]  ,  p.  3  l6:    "Provision  should  be  made  to  register  the  number  of 
times  each  word  is  looked  up  in  the  index  and  the  number  of  times  each  family 
number  has  been  used  for  encoding.    Such  a  record  would  be  an  indispensable 
part  of  the  system  for  making  periodic  adjustments  based  on  the  usage  of  words 
or  notions  as  mechanically  established." 

Schultz,  1962  [529],  p.  104. 

Eldridge  and  Dennis,   1962  [  183]  ,  p.  6. 

±1 

Giuliano  ajid  Jones,   1962  [229],  p.  12. 

II 

It  should  be  noted  that  Tabledex,  the  Scan-Column  Index,  and  similar  tools  pro- 
vide to  some  extent  a  display  of  prior  associations  between  index  terms.  (See 
pp.  25-27  of  this  report.)    Thus  Cheydleur  (1963  [  Hi],  p.  38)  remarks:    "Lsdley.  . 
has  focussed  on  inter-item  concepts  in  designing  his  economical  TABLEDEX 
arrangement  for  displaying  the  connectivity  of  index  terms  and  related  file  items.  " 
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6.  2.  1    Devices  to  Display  Associations:  EDIAC 

The  interest  aroused  among  some  documentalists  by  the  provocative  idea  of  a  "Memex" 
to  record  and  display  associations  betvi^een  ideas  as  proposed  by  Bush  in  1945  ([93])  led  to 
specific  attempts  at  Documentation,   Inc.  in  the  1950's  to  develop  a  device  v^hich  would 
incorporate  at  least  the  associations  between  indexing  terms  assigned  to  documents  and 
between  documents  with  respect  to  their  sharing  of  comxnon  indexing  terms  (1954  [157], 
1956  [155,    156]).    The  first  approach  to  this  objective,  as  reported  by  Taube,  was  the  idea 
of  a  manual  dictionary  of  terms  arranged  in  alphabetical  order,  with  a  "page"  reserved  for 
each  and  every  indexing  term  used  for  any  document  in  the  collection.     On  each  page  would 
be  listed  all  other  terms  that  had  co-occurred  v^th  that  term  in  the  indexing  of  one  or  more 
documents.    Another  idea  was  to  display  associations  of  terms  used  in  a  collection  through 
the  "superimposition  of  dedicated  positions  in  a  set  of  cards  or  plates.  .  .  "  1/ 

Subsequently,   an  actual  device  to  demonstrate  a  system  for  display  of  term-term, 
term-document,   and  document-document  associations,  was  built  under  an  Office  of  Naval 
Research  contract.  2/    The  demonstration  model  contained  a  vocabulary  of  250  terms  which 
had  been  used  in  various  combinations  to  index  100  reports.    Interconnections  in  an  elec- 
trical network  provided  the  as sociational  linkages.    A  display  panel  was  provided  with 
symbol-indicators  which  could  be  lighted  up  to  identify  particular  terms  and  particular 
report  niimbers. 

This  EDIAC  device  (for  Electronic  Display  of _Indexing  Association  and  C^ontent)  was 
intended  for  use  both  in  guiding  an  indexer  to  either  the  extension  or  refinement  of  his 
initial  choice  of  indexing  terms  and  in  assisting  the  searcher.    It  was  claimed  that  the 
operation  of  such  a  device  would  be  extremely  simple.  Thus: 

"For  the  index  question  the  searcher  selects  any  term  in  which  he  is  interested 
and  applies  a  voltage.    He  is  told  instantly  the  number  of  the  reports  dealing  with 
that  subject.    Putting  voltage  in  at  any  term  also  lights  all  other  terms  associated 
with  the  first  term.  .  .  "  3_/ 

A  later  analog  device,    ACORN,    will  be  discussed  below  in  connection  with  the  work 
of  Giuliano  and  associates,  at  Arthur  D.  Little,  Inc. 

6.  2.  2   Statistical  Association  Factors  -  Stiles 

The  name  of  H.  Edmund  Stiles,   like  those  of  Luhn,   Baxendale,   Maron,  Swanson, 
Edmundson  and  Wyllys,  is  generally  associated  with  pioneering  innovations  in  those  areas 
of  mechanized  documentation  which  are  directly  related  to  the  use  of  high-speed  computer 
capabilities.    While  Stiles'  work  has  been  directed  primarily  to  problems  of  search 
prescription  formulation  and  renegotiation  based  on  the  results  of  preliminary  search,  he 
has  specifically  recognized  that  the  use  of  statistical  word  association  techniques  in 
searching  operations  can  provide  a  logical  corollary  to  automatic  indexing  procedures. 
Thus : 


1/ 

Taube  et  al,    1954  [599],  p.  102. 

It  is  described  and  illustrated  in  Taube  et  al,  1956  [599],  p.  63  ff. 
Documentation,  Inc.   1956  [156],  p.  7. 
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"Automatic  indexing,  based  on  the  relative  frequency  of  words  used  in  a  document, 
produces  a  partial  vocabulary  of  the  content  words  used  to  express  its  subject. 
Retrieval  can  then  be  accomplished  by  expanding  the  request  vocabulary.  .  .  This 
method  tends  to  overcome  the  deficiencies  and  inconsistencies  inherent  in  the  use 
of  terms  derived  automatically  from  a  text.  "  _1/ 

Conversely,  Stiles  also  points  out  the  possibility  that  the  results  of  automatic  derivative 
indexing  procedures,  extracting  indexing  words  from  the  documents  directly,  might  prove 
a  more  realistic  or  reliable  basis  for  the  development  of  his  word  co-occurrence  correla- 
tion data  than  do  the  Uniterms  assigned  by  human  indexers.  2^1    The  work  of  Stiles  has  also 
stressed  the  importance  of  two  factors  that  may  well  be  critical  for  the  improvement  of 
automatic  indexing  techniques.    These  are,  namely,  the  consensus  of  prior  human  indexing 
and  the  consensus  of  subject  coverage  of  a  particular  collection.  ^/ 

In  his  experimental  investigations,  Stiles  began  with  an  existing  collection  of  approx- 
imately 100,  000  items  which  had  previously  been  indexed,  over  a  period  of  time,  with  a 
Uniterm  indexing  vocabulary  consisting  of  about  15,000  terms.    The  objective  of  the 
experiments  was  to  determine  how,   given  a  specific  search  request,  a  more  effective  "net 
to  catch  documents"  4/  could  be  generated  and  how  the  responding  items  might  be  ranked 
in  order  of  their  probable  relevance  to  the  request. 

The  statistics  of  co-occurrence  of  terms  used  to  index  the  same  documents  were  first 
obtained.    A  modified  chi-square  formula  was  then  applied  to  determine  relative  fre- 
quencies of  use  of  co-occurring  terms.  5_l    Patterns  of  term  co-occurrence  could  then  be 
derived  in  the  sense  of  term-profiles  which  show,   for  each  term,   the  more  significant  of 
its  as s ociational  values  of  pairing  with  other  terms  in  the  collection.    The  actual  procedure 
for  using  these  term-profiles  in  search  prescription  formulation  and  in  document  selection 


involves  several  steps, 

generally  as  follows:  6/ 

y 

Stiles , 

1962  [573], 

pp.  12-13. 

11 

Stiles , 

1961  [572], 

p.  205. 

11 

Stiles, 

1962  [573], 

p.  6  and  1961  [572],  pp.  273,  277 

1/ 

Stiles , 

1961  [572], 

p.  192. 

5/ 

In  general,  we  shall  not  be  concerned  with  the  precise  mathematical  formulations. 
It  is  to  be  noted  that  in  a  recent  report  Giuliano  and  his  colleagues  have  reviewed 
a  number  of  the  various  mathematical  formulas  proposed  in  the  literature  for  the 
computation  of  word,  term,  and  document  associations,  including  those  of  Parker- 
Rhodes  and  Needham,  Maron  and  Kuhns,  Stiles,  Salton,   Osgood,  Bennett  and 
Spiegel  (Giuliano  et  al,    1963  [230],  Appendix  I). 

II 

Stiles,   1961  [571],  pp.  273-275. 
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1.  For  each  term  in  the  initial  formulation  of  a  search  request,  the 
appropriate  term-profile  is  obtained,  which  gives  weighted  values 
for  those  other  terms  that  had  significantly  co-occurred  with  it. 

2.  The  profiles  of  each  term  in  a  multi-term  request  are  compared 
and  those  additional  terms  common  to  all  or  a  specified  number  of 
the  profiles  are  selected  and  added  to  the  initial  set.i./ 

3.  The  "first  generation"  terms  resulting  from  step  2  are  next  treated 

as  though  they  also  were  request*  terms,  and  steps  1  and  2  are  repeated 
for  them. 

4.  A  selection  is  made  from  some  reasonable  proportion  of  the  profiles 

associated  with  the  first  generation  terms  to  produce  the  "second 
7  1 

generation"  terms.  ±.' 

5.  The  expanded  list  of  search  terms  is  then  compared  with  the  index 
terms  assigned  to  each  document  in  the  collection,  and  whenever  a 
match  is  found  the  weight  of  the  request  term  is  assigned  to  the 
matching  document  term.     These  weights  are  then  summed  to  provide 
a  numeric  measure  of  probable  document  relevance  to  the  original 
request. 

6.  Documents  responding  to  the  expanded  request  are  printed  out  in  the 
order  of  document  relevance  scores. 

Some  experiments  have  been  made  using  a  computer  program  which  accepts  up 
to  300  weighted  terms  in  an  expanded  request  vocabulary.    Representative  results  have 
been  reported,  in  part,  as  follows: 

".  .  .  We  asked  a  qualified  engineer  to  examine  these  documents  and  specify  which 
were  related  to  'Thin  Films'  and  which  were  not.  .  .  This  engineer  was  not 
familiar  with  our  project.  .  .yet.  .  .we  found  a  remarkably  high  correlation  between 
his  evaluation  and  the  document  relevance  numbers.  .  .  We  then  checked  to  see  how 
the  documents  containing  information  on  'Thin  Film'  had  been  indexed.    We  found 
that  the  first  five  documents  on  our  list  had  been  indexed  by  both  'Thin'  and  'Film'. 
Three  more  documents  had  been  indexed  by  'Film'    alone,  and  other  related  terms. 
Two  documents  had  not  been  indexed  by  either  'Thin'  or  'Film',  but  only  by  a  group 
of  related  terms,  yet  they  contained  information  on  'Thin  Films'  and  had  a  high 
document  relevance  number.    By  using  association  factors  and  a  series  of  statisti- 
cal steps,  easily  programmed  for  a  computer,  we  were  thus  able  to  locate 


These  are  called  "first  generation  terms"  and  tend  to  reflect  only  statistical  asso- 
ciations without  including  synonyms  and  near- synonyms  which,  over  the  course  of 
time,  have  occurred  in  the  indexing  vocabulary. 

Stiles,   l96l  [  57l]  ,  p.  274:    "Among  these  we  find  words  closely  related  in  meaning 
to  the  request  terms."  An  example  given  in  Ref.  [572],  pp.  200-201,  is  the 
derivation  of  'weathering,  '  'f\ingicidal' ,   'deterioration',  and  'preservatives'  as 
second  generation  terms  when  the  initial  request  included  the  terms  'plastics', 
'fvmgus',   'coating',  and  'tests'. 


documents  relevant  to  a  request  even  though  the  documents  had  not  been  indexed 
by  the  terms  used  in  the  request.  "  }J 

In  another  case,  which  was  analyzed  in  detail,  a  request  profile  of  2  6  terms  that 
had  been  intuitively  weighted  by  the  customer  resulted  in  the  machine  listing  of  246 
presumably  responsive  documents.    Of  these,  81  documents  were  of  primary  interest 
to  the  customer,  and  an  additional  78  were  of  secondary  interest  to  him.  ^1 

The  statistical  association  technique  as  proposed  by  Stiles  has  also  been  investi- 
gated at  the  Datatrol  Corporation,  with  particular  reference  to  the  field  of  legal 
literature  (Hammond  et  al,   1962  [25l]).  About  350    documents  in  the  field  of  Federal 
public  law  were  indexed  in  cooperation  with  George  Washington  University,  using  a 
vocabulary  of  680  index  terms.    A  computer  program  was  written  for  the  IBM  7090  that 
can  accommodate  a  1200  x  1200  matrix  to  calculate  the  Stiles'  association  factors.  Trials 
were  made  of  various  thresholds  to  determine  which  other  terms  were  sufficiently  high  in 
association  strength  to  a  particular  term  to  be  selected  for  that  term's  profile. 

Given  the  generation  of  the  term  profiles,  a  less  sophisticated  computer  such  as 
the  1401  can  be  used  for  the  expansion  of  request  terms  and  the  actual  conduct  of  searches. 
Such  a  program  was  demonstrated  at  the  Annual  Meeting  of  the  American  Bar  Association, 
August  1962,  with  running  of  "live"  requests  suggested  by  jurists  and  with  what  are 
claimed  to  be  "highly  gratifying  results".    A  point  of  interest  relates  to  the  question  of 
updating  of  term-profiles  and  other  statistical  association  factor  data.    Hammond,  et  al 
report: 

"The  term  profiles  were  generated  a  total  of  three  times  in  the  course  of  the 
pilot  study,      making  it  possible,  to  some  extent,  to  assess  the  effect  of 
vocabulary  growth.    Judging  from  this  limited  experience,  it  appears  that  a  bi- 
monthly, or  perhaps  even  quarterly,   recompilation  of  term  profiles  should  be 
sufficient  for  a  mature  collection.  "  \1 

6.2.  3    The  Association  Map     -     Doyle  and  Related  Work  at  SDC 

The  name  of  Doyle  is  again  that  of  an  early  and  prolific  investigator  and  innovator 
in  the  field  of  mechanized  documentation  and  linguistic  data  processing.    One  of  his 
provocative    suggestions  is  generally  known,  in  his  own  terminology,  as  that  of  "semantic 
road  maps  for  literature  searchers"  or  an  "association  map"  technique.    As  a  matter  of 
convenience,  we  have  chosen  to  consider  this  suggestion  and  a  variety  of  related  work 


y 

Stiles,  1961  [577],  pp.  198-199. 
Stiles,   1962  [  573],  p.  9- 
Hammond  et  al,   1962  [25l],  p.  6. 
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under  the  general  heading  of  the  association  map  technique,  —  although  passing  reference 
has  been  made  to  some  of  Doyle's  suggestions  and  findings  elsewhere  in  this  report. 

Beginning  in  1958  (Doyle,   1959  [l68])  information  retrieval  projects  at  the  System 
Development  Corporation  have  had,  among  other  objectives,  that  of  developing  ways  to 
use  computers  in  the  processing  and  interpretation  of  natural  language  text.    By  February 
of  1959»  a  computer  program  was  already  in  operation  that  could  search  fragments  of 
about  100  words  of  keypunched  text,  match  input  words  against  a  pre-established  clue  word 
selection  list  (i.  e.  ,   an  inclusion  dictionary)  and  substitute  a  short  encoded  form  to  be 
used  for  subsequent  search.    Processing  of  keypunched  abstracts  using  this  program  in- 
volved computer  time  at  the  rate  of  four  abstracts  per  second. 

Other  features  of  this  text  compiler,  and  of  subsequent  text  processing  programs 
developed  at  SDC,   enable  the  making  of  frequency  counts  and  other  statistical  measures. 
Such  features  are  then  used  for  the  investigation  of,  for  example,  word-word,  word- 
document,  and  word-subject  associations,  looking  toward  the  determination  of  answers  to 
such  questions  as:  "Do    subject  words  have  distribution  characteristics  within  a  library 
that  a  computer  program  can  detect?"  2_/ 

Doyle's  investigations  of  word  co-occurrences  have  included  hypotheses  and  tests 
of  various  probabilistic  measures  in  terms  of  observed  frequenci^^,  in  terms  of  "boing!" 
words  (so-called  because  of  the  mental  sound  effect  they  elicit),  —    in  terms  of  adjacent 
word  pairs  and  affinities  between  particular  nouns  and  particular  adjectives,  ^/  and  in 
terms  of  distinctions  between  frequency  (the  total  number  of  times  a  word  appears  in  a 
given  library  corpus)  and  prevalence  (the  total  number  of  items  in  which  a  particular  word 
appears).       He  has  also  stressed  distinctions  between  adjacent  words  and  high  corre- 
lations for  words  that  are  not  closely  positioned  together  in  text,  as  follows: 


1/ 


2/ 

3/ 

5/ 


Compare  Doyle  himself,   1962,  [163],  p.  383:    "Swanson  and  others  have  offered 
thesauri  of  synonyms  and  related  terms.  .  .  (to  assist  in  indexing  or  search 
processes).  ..  An  association  map  is,  in  a  sense,  an  extension  of  this  solution;  it  is 
a  gigantic,  automatically  derived  thesaurus.    Confronted  by  such  a  map,  the 
searcher  has  a  much  better  'association  network'  than  the  one  existing  in  his  mind, 
because  it  corresponds  to  words  actually  found  in  the  library,  and,  therefore,  words 
which  are  best  suited  to  retrieve  information  from  that  library.  "    See  also  Wyllys, 
1962  [65l],  p.  16:    "Lr.  B.  Doyle  (1961)    has  invented  a  fascinating  search  tool  which 
seems  to  us  to  belong  at  a  level  intermediate  between  automatic  indexes  and  auto- 
matic abstracts;  i.  e.  ,   a  possible  search  method  might  be  to  have  the  computer  scan 
automatic  indexes  and  compare  the  index  terms  therein  with  the  request,  then 
obtain  the  possibly  pertinent  documents  and  display  their  association  map  for  the 
user  to  examine.  .  .  " 

Doyle,  1959  [  l68],  p.  6. 

Doyle,  1959  [l65],  p.  5. 

Doyle,  1961  [169],  p.  12;  1959  Cl65],  p.  16. 

Doyle,  1962  [  163],  p.  380. 
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"We  have  also  perceived  that  two  different  cognitive  processes  seem  to  be 
responsible  for  each  type  of  correlation,  one  (adjacent  correlation)  involving 
the  habitual  use  of  word  groups  as  semantic  units,  and  the  other  (proximal 
correlation)  having  to  do  with  the  patters^  of  reference  to  various  aspects  of 
that  which  is  being  discussed.     We  can  call  the  statistical  effects,  respectively, 
'language  redundancy',  and  'reality  redundancy"'.    Such  a  resolution  of  statistical 
effects  is  full  of  significance  for  information  retrieval  because  it  appears  likely 
that  reality  redundancy  can  vary  greatly  from  one  science  to  another,  whereas 
language  redundancy,  a  universal  property  of  talking  and  writing,  is  relatively 
invariant.  "  ]_l 

With  respect  to  the  "semantic  roadmap"  or  "association  map"  technique  itself, 
Doyle's  suggestion  is  that  various  measures  of  word  ajid  index  term  cross-associations 
may  be  applied  to  the  generation  of  graphic  displays  of  both  types  of  co-occurrence 
relationships.    Because  of  the  variety  of,  in  particular,  the  "proximal"  correlations,  it 
is  assumed  that  the  literature  searcher  should  be  given  a  display  in  which  the  repre- 
sentation of  the  assemblage  of  the  varied  relationships  is  two-dimensional  rather  than 
one.  ^/    An  example  is  given,  based  upon  computer  processing  of  600  abstracts  of  SDC 
internal  reports  to  find  intersections  between  500  topical  words,  of  associational  con- 
nections for  the  word  "output".     This  was  generated  by  selecting  the  eight  words  most 
strongly  correlated  in  the  data  with  "output",  such  as  "manual"  and  "radar",  and  then 
finding  three  other  words  highly  correlated  with  each  of  these  and  also  correlated  with 
"output"  itself.    From  the  initial  graph,  it  is  further  shown  that  item  surrogates  might 
be  generated  by  word  selection  rules  applied  to  documents  to  pick  up,  for  example, 
"New  York  Air  Defense        system        data        outputs        D.  C  ^' 

Continuing  related  work  by  Doyle  and  others  at  SDC  has  included  various  experi- 
mental studies  of  "pseudo-documents"  consisting  of  lists  of  the  twelve  most  frequently 
occurring  words  in  100-item  samples  of  abstracts  in  various  subject  fields  (Doyle,  1961 
[l6l])-    Of  special  interest  in  terms  of  potential  improvements  ajid  modifications  to 
machine  indexing  techniques  are  studies,  based  on  similar  lists,  looking  to  the  separa- 
tion of  words  that  may  have  been  used  in  several  different  senses,  i.  e.  ,  the  detection  of 
homographs  by  statistical  means  (Doyle,   1963  Cl7l]  )    More  recent  investigations  by 
Doyle  involve  considerations  of  differences  between  word-grouping  and  document-group- 
ing techniques  and  of  possibilities  for  use  of  hybrid  methods. 

6.  2.  4    Work  of  Giuliano  and  Associates,  the  ACORN  Devices 

A  program  directed  toward  the  design  of  "an  English  command  and  control  language 
system"  under  an  Air  Force  contract  with  Arthur  D.  Little,  Inc.  ,  involves  several  inter- 
related aspects  of  natural  language  text  processing,  use  of  statistical  association  factors 
in  search,  man-machine  interaction  during  search,  and  display  of  associational  relation- 
ships by  means  of  analog  network  devices.    In  this  program  and  in  related  research, 
Giuliano  and  his  associates  are  convinced  that: 


Doyle,  1961  [  169], 
Doyle.  1962  [  l63], 
Doyle,  1961  [  I69]  , 


p.  15. 
p.  379. 
pp.  24-25. 
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"Automatic  index  term  association  techniques  are  needed  to  improve  the  recall 
of  relevant  information,  to  enable  indexers  and  requestors  to  use  language  in  a 
more  natural  manner,  and  to  enable  retrieval  of  relevant  messages  which  are 
described  by  different  index  terms  than  those  used  in  the  inquiry.  "  ^1 

For  the  most  part,  the  work  to  date  has  been  directed  to  "associative  retrieval"  of 
messages  limited  to  single  sentences  of  English  text,  and  to  the  search  phases  of  a  pro- 
posed system. 

In  the  case  of  a  corpus  consisting  of  230  sentences  from  a  single  text,  a  partially 
automatic  indexing  method  was  used.     The  text  was  first  processed  against  a  modified 
version  of  the  Harvard  Multipath  Syntactic  Analysis  computer  program  and  the  resulting 
analyses  were  manually  screened  to  select  a  tinique,  correct  analysis  for  each  sentence. 
Next,  approximately  500  words,  those  that  had  been  marked  "noun"  by  the  syntactic 
analyzer,  were  listed  out  and  these  in  turn  were  manually  screened  to  provide  an 
"inclusion  list"  of  273  words.    Sentences  were  then  "indexed"  with  respect  to  which  of 
these  selected  words  they  contained.    Word  associations  were  computed  both  in  terms  of 
co-occurrence  within  a  sentence  and  of  co-occurrence  in  syntactic  structures. 

Retrieval  tests  were  then  applied  using  both  computer  programs  and  the  analog 
device,  and  evaluations  were  made  on  the  basis  of  examining  sentences  selected  in  order 
of  machine- ranked  relevance  and  of  comparisons  of  word  lists  associated  with  a  given 
search  term  against  association  lists  for  another  term  picked  at  random.    It  is  noted  that, 
"although  quantitative  conclusions  cannot  be  drawn",  the  results  support  the  conclusion 
that:    "Items  retrieved  due  to  automatically-generated  associations  tend  to  be  more  rele- 
vant than  is  explainable  on  a  chance  basis,"  ^/ 

The  "request  reformulation"  retrieval  program  has  also  been  used  to  generate  term 
profiles  from  a  collection  of  approximately  10,  000  documents  (previously    indexed  with  at 
least  6  terms  from  a  selective  term  vocabulary  of  1,  000  terms)  which  have  then  been 
compared  against  lists  provided  in  the  entries  for  corresponding  terms  in  the  Thesaurus  of 
ASTIA  Descriptors.  Second  Edition.    The  machine-produced    association  lists,  at  least 
for  those  words  occurring  relatively  frequently  in  the  corpus,  appear  to  give  thesaurus 
entries  that  are  extensive,   specific,  and  intuitively  acceptable,  and  of  high  quality, 
especially  with  respect  to  listings  of  synonyms  as  well  as  factually  related  words.  — 

The  development  of  the  ACORN  (Associative  Content  Retrieval  Network)  devices 
has  provided  additional  tools  for  testing"  and  display"" (1962  r229],   196'3  [227,  304]). 
These  devices  are  networks  of  passive  resistance  elements.    Each  word  or  index  term 
and  each  sentence  (240  by  230  in  ACORN-IV)  are  represented  by  terminals  interconnected 
by  resistors  with  conductance  equal  to  the  connection  strength,  and  with  "leak"  resistors 


1/ 

Giuliano  1962  [228]  ,  p.  10. 
Giuliano  et  al,  1963    [230],  p.  47. 
Ibid,  pp.  57-58. 
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providing  for  various  normalizations  that  may  be  applied  to  compensate  for  word  or 
sentence  frequency  factors.    These  devices  differ  from  the  earlier  EDIAC  in  the 
variable  weightings  provided,  in  the  normalizations  that  may  be  applied,  and  in  multipath 
interconnections. 

When,  for  example,   currents  are  applied  at  some  of  the  word  terminals,  the  volt- 
ages appearing  on  any  of  the  other  word  terminals  depend  on  the  strengths  of  association 
between  these  words  ajid  the  input  words  via  all  direct  and  indirect  paths.     The  responses 
of  sentence  terminads  to  the  input  words  of  a  query  similarly  depend  upon  how  strongly  a 
sentence  is  connected  to  these  words  and  how  strongly  it  is  connected  to  other  words 
which  in  turn  are  strongly  connected  to  the  query  words.    It  is  to  be  noted  further  that: 

"Pulling  out  or  cutting  a  few  randomly  selected  wires  in  an  ACORN  generally 
has  a  surprisingly  small  effect.  .  .    This  insensitivity  is  of  course,  explainable 
in  terms  of  the  multiplicity  of  indirect  and  redundant  association  paths  which 
remain  intact  when  a  direct  path  is  severed.  .  .  It.  .  .  sup^ests  that  the  retrieval 
process  can  indeed  be  made  insensitive  to  minor  variations  in  indexing.  "  ^/ 

In  addition,  there  are  intriguing  possibilities  for  imposing  a  "viewpoint"  with 
respect  to  a  search  by  injecting  bias  currents.     Thus  if  only  non-"Air  Force"  jet 
plane  items  were  desired,  the  "Air  Force"  items  could  in  effect  be  grounded  out.    If  there 
were  no  jet  items  in  the  collection  other  than  those  which  were  also  Air  Force  items, 
these  would  be  indicated  as  responsive,  but  largely  they  would  appear  only  if  this  should 
be  the  case.    Some  words  used  have  some  connection  to  almost  all  other  words,  but  these 
have  little  effect  in  the  system  and  the  hardware  thus  tends  to  compensate  for  the  high 
frequencies  of  very  general  words. 

6.  2.  5    Spiegel  and  Others  at  Mitre  Corporation 

Bennett  and  Spiegel,   reporting  at  the  Symposium  on  Optimum  Routing  in  Large 
Networks,  IFIP  Congress-1962,  —I  consider  modifications  to  formulas  for  the  calculation 
of  statistical  association  factors  which  will  normalize  against  such  influences  as  frequency 
of  word  occurrences,  relative  word  position  within  a  string  of  words,  and  string  length. 
This  work  has  been  carried  forward  at  the  Mitre  Corporation  in  a  program  for  developing 
procedures  to  encode  various  statistical  properties  of  messages  or  documents  and  to  use 
these  codes  for  message  routing  and  retrieval. 

Differences  between  this  approach  and  those  of  Maron  and  Kuhns,  Stiles,  and  Doyle, 
relate  primarily  to  the  questions  of  how  best  to  normalize.    The  objective  is  closely 
similar:    to  use  associational  weighting  so  as  to  provide,  in  response  to  a  query,  output  of 
documents  or  messages  ranked  in  order  of  probable  relevance  to  the  query. 


II 

Giuliano  and  Jones,   1962  [  229],  p.  22. 

See  Juncosa,  1962  [306],  especially  paper  4,  E.   Bennett  and  J.  Spiegel, 
"Document  and  message  routing  through  communication  content  analysis", 
pp.  718-719. 
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Additional  features  include  provision  for  the  matrix  of  coefficients  of  association 
to  change  with  time  or  with  deliberate  manipulation  to  improve  performance.  Thus: 

"Each  normalized  cell  weight.  .  .  rises  and  falls  with  time  as  each  specific 
association  increases  or  decreases  in  relative  frequency.    In  this  way,  the 
matrix  memory  of  associations  changes  with  time,  maintaining  a  cumulative 
pattern  of  associations  reflecting  one  statistical  characteristic  of  messages 
fed  into  it  in  the  past.  .  . 

"In  addition  to  this  adaptive  characteristic  of  changing  memory  with  time  and  with 
changing  inputs,  the  matrix  is  also  readily  subject  to  formal  education.  Any 
specific  cell  weight  can  be  strengthened  by  repeatedly  reading  into  the  matrix 
memory  the  specific  strings  that  contain  the  desired  associations.    For  example, 
by  introducing  the  strings  is  am,  is  are,  am  is,  am  are,  and  are  am,  we  can 
increase  the  statistical  tendency  of  the  tokens  is,  am,  and  are  to  be  associated.  "  — 

Experimental  results  have  been  obtained  for  a  corpus  of  500  bibliographic  entries 
contained  in  DDC's  Title  Announcement  Bulletin.    In  the  case  of  a  three-term  query,  40 
items  were  selected  and  ranked  in  probable  relevance  order,  with  selection  based  on  a 
particular  relevance  score  value  threshold.    The  investigators  then  reviewed  the  abstracts 
of  all  500  items  and  rated  them  as  to  relevance  with  respect  to  the  query.  Seven 
additional  items  were  found,  of  which  three  would  have  been  machine-selected  with  a 
less  stringent  selection  threshold.    For  the  remaining  four,  it  is   reported  that  they  "were 
poorly  indexed  and  could  have  been  judged  not  relevant  by  a  human  who  depended  upon  the 
descriptor  string  only,  as  the  matrix  did,   rather  than  upon  review  of  the  abstracts.  "  ^1 

6.  3     Clues  to  Index-Term  Selection  from  Automatic  Syntactic  Analysis 

Several  of  the  organizations  and  research  teams  most  active  in  the  investigation 
I  of  linguistic  data  processing  techniques,  especially  for  automatic  indexing,  extracting 
and  search  renegotiation  applications,  are  actively  considering  the  use  of  clues  derived 
from  automatic  syntactic  analysis  to  improve  criteria  for  machine  selection  of 
"significant"  words,  phrases,  and  sentences  from  raw  text.    Such  approaches,  in  general, 
however,  are  subject  to  the  limitations  of  non-availability  of  sufficient  corpora  of  text 
in  machine -usable  form,  in  the  first  place,'  and,  even  more  importantly  by  the  non- 
availability of  satisfactory  computer  programs  for  complete  syntactic  analysis  up  to  the 


±1 

1  Spiegel  et  al,  1963  [  566],  p.  17. 

Ibid,  p.  34. 
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present  time.    —    In  terms  of  the  state-of-the-art  of  automatic  indexing,  therefore,  we 
shall  not  consider  these  approaches  as  more  than  indications  for  future  research.    A  few 
suggestive  examples  are  discussed  briefly  below. 

The  multi- pronged  attack  on  mechanized  information  selection  and  retrieval 
problems  headed  by  Salton  and  his  associates  includes  the  exploration  of  tree  structures, 
to  represent  both  the  relationships  between  terms  in  a  classification  schedule  or  indexing 
term  vocabulary  and  the  representation  of  the  results  of  automatic  syntactic  analyses  of 
natural  language  text.    It  is  proposed,  then,  that  computer  programs  can  achieve  trans- 
formations of  the  syntactic  trees  representing  word  strings  in  the  original  text  into 
simplified,   condensed  structures  with  normalized  terms  and  can  compare  these  trees 
with  the  clas sificatory  trees  (Salton,   1961  [5l6]).    Manipulation  of  such  trees  together 
with  appropriate  dictionaries  or  thesauri  can  result,  for  a  given  proposed  index  term,  in 
the  finding  of  a  preferred  term  for  a  particular  system,   or  a  set  of  synonymous  terms,  or 
sets  of  all  terms  in  which  the  given  term  is  included,  and  the  like. 

Anger  considers  some  of  the  problems  involved  in  complete  syntactic  analysis  of 
texts  with  the  objective  of  identifying  the  total  network  of  relationships  expressed  and 
implied,  as  proposed  by  Lecerf,  Ruvinschii,    and  Leroy,  among  others,  of  the  Research 
Group  on  Automated  Scientific  Information  (GRISA),  EURATOM.    Assuming  that  computer 
programs  for  syntactic  analysis  are  or  will  be  available,  he  suggests  that  simplifications 
may  be  obtained  by  determining  only  the  basic  relations  that  are  indicated  by  direct 
syntactic  dependencies  or  by  linking  words,  (Anger,   1961  [l5l  ). 

A  specific  program  for  automatically  extracting  syntactic  information  from  text  has 
been  studied  by  Lemmon  (1962  [  354]  ).     The  possibilities  for  combining  dictionary  lookups, 
word  suffixes  as  indicators  of  syntactic  role,  and  predictive  syntactic  analysis  for  text 
processing  have  also  been  further  explored  by  Salton  himself  (1962  [  518]  ,   1963  [  519]  ). 
A  variety  of  word  and  document  association  techniques  and  of  synonymous  word  and 
phrase  groupings  which  serve  to  "clue"  the  selection  of  a  subject  heading  are  also  being 
investigated  by  members  of  the  Harvard  group  and  guest  investigators. 

1/ 

Major  difficulties  have  to  do  with  limitations  both  upon  grammars  and  vocabularies 
so  far  tested  and  with  ambiguities  and  the  number  of  alternative  parsings  generated. 
See,  for  example,   Bobrow,   1963  [68].    Kuno  and  Oettinger,   1963  [34l]  and 
Robinson,   1964  [  502].    Bobrow  provides  a  survey  of  syntactic  analysis  programs 
as  of  1963,  noting  limitations  or  restrictions  on  each.    He  reports,  for  example, 
that  available  programs  to  compute  word  classes  are  not  always  correct  in  the 
class  assignments  made  and  that  analysis  systems  are  not  complete  unless  they 
provide  means  for  distinguishing  between  "meaningless  strings  and  grammatical 
sentences  whose  meaning  can  be  understood".    He  concludes:    "Until  a  method  of 
syntactic  analysis  provides,  for  example  a  means  of  mechanizing  translation  of 
natural  language,  processing  of  a  natural  language  input  to  answer  questions,  or  a 
means  of  generating  some  truly  coherent  discourse,  the  relative  merit  of  each 
grammar  will  remain  moot."    (  [68],  p.  385)  Robinson  (  [  502],  p.   12)  says  of 
sentences  which  can  be  parsed  correctly,  that  they  are:   "Usually  short  sentences 
with  no  complicated  embeddings  of  relative  clauses  and  few  participial  or 
prepositional    phrase  modifiers.    These  include  the  basic  sentences  that  most 
grammars  are  equipped  to  handle  and  that  adult  writers  seldom  produce." 
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Another  partial  approach  to  applying  syntactic  analysis  techniques  to  automatic 
indexing  is  based  upon  syntactic  word-class  recognitions.    Giuliano  and  his  associates 
at  Arthur  D.  Little,  Inc.  ,  (1963  [230]  ),  have  investigated  on  a  small-scale  basis  the 
use  of  the  Kuno-Oettinger  programs  developed  at  Harvard  for  this  purpose  (K\ino  and 
Oettinger,   1963  [  340]).     The  broad  program  of  information  and  language  data  processing 
research  at  System  Development  Corporation  specifically  includes  investigations  of 
structural  patterns  of  sentences  at  the  syntactic  level  and  also  of  semantic  factors 
such  as  the  studies  of  polysemy  and  homographic  ambiguity  by  Doyle,  Wasser,  and 
others.    Borko  reports: 

"...  We.  .  .  are  analyzing  actual  written  text  for  multiple  meanings.  .  .  The  data 
for  this  study  were  drawn  from  the  corpus  of  618  psychological  abstracts. 
Tabulations  of  frequency  of  paired  and  single  word  listings  were  used.  A 
number  of  corpus-derived  word  frames  have  been  prepared.    Although  this 
research  is  still  in  its  early  phase,  we  feel  that  we  have  made  a  good  start 
on  the  problems  of  semantic  analysis.  " 

In  Czechoslovakia,  at  the  Karlova  Universita,  both  statistical  and  semantical  methods  for' 
automatic  abstracting  are  reported  as  being  under  consideration.^/ 

Other  examples  of  proposals  for  the  use  of  syntactic  analysis  techniques  for  the 
improvement  of  automatic  indexing  products  include  those  of  Spangler,  Levery,  Plath, 
Thome,  and  Climenson  and  his  colleagues  at  RCA,  as  well  as  the  suggestions  of  those 
whose  interests  in  automatic  syntactic  analysis  have  been  primarily  directed  to  problems 
of  machine  translation  or  more  general  problems  of  linguistic  analysis.    Hays,  for 
example,  although  principally  concerned  with  MT,  indicates  that  the  methods  for 
determining  phrase  structures  have  obvious  applications  to  the  automatic  determination 
of  categories  useful  in  the  indexing  of  documents.  _3/ 

An  existing  GE-225  computer  program  for  KWIC-type  indexing  from  both  titles  and 
abstracts  at  General  Electric 's  Phoenix  Laboratories  is  being  extended  to  incorporate 
word  analysis  features  taking  into  account  both  syntactic  and  semantic  aspects  of  a  given 
line  or  sentence  of  text.  ^/   Levery  provides  an  example  of  similar  directions  being  ex- 
plored in  European  research,  more  generally  oriented  toward  linguistic  considerations 
as  such  than  to  machine-derivable  criteria  (largely  statistical  to  date),  which  seek  to 
combine  the  benefits  of  both  human  and  machine  processes  by  way  of  automatic  syntactic 
analyses.    He  claims,  for  example,  that: 

y 

Borko,  1962  [75],  p.  6. 
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National  Science  Foundation's  CR&D  report.  No.   ll[430],  p.  123. 
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See  Hays,   1961,  [  258],  p.   13:    "...  Two  broad  problems  on  which  work  is  just 
beginning  at  RAND:    grammatic  transformations  and  distributional  semantics.  The 
latter  problems  are  especially  important  for  automatic  indexing,  abstracting,  ajid 
text  searching."    See  also  de  Grolier,  1962  [l52],  p.  137. 

1/ 

National  Science  Foundation*  scR&D  report  No.  11  [430],  p.  21. 
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"...  The  study  of  the  position  of  keywords  in  the  text  and  the  syntactical 
relationship  which  exists  among  them  will  show  the  way  to  automatic  ab- 
stracting and  the  use  of  more  sophisticated  retrieval  systems.  "  }_l 

Plath  suggests  that,  given  a  computer  program  to  perform  the  parsing  and 
syntactic  diagramming  of  a  text  sentence,  the  results  can  serve  quite  usefully  to  augment 
the  selection  criteria  based  initially  on  statistical  techniques,   such  as  word-frequency 
counting.    He  says,  for  example: 

"Anotlier  possible  application  of  the  outputs  of  the  sentence  diagramming  program 
is  their  employment  as  an  aid  in  language  data  processing  for  purposes  of 
information  retrieval,  particularly  in  systems  for  automatic  literature  abstracting 
of  the  sort  proposed  by  Luhn  (1958).    The  feature  of  the  tree  diagrams  which  is 
pertinent  here  is  that  the  main  components  of  a  clause,  including  subject,  verb 
and  object,  always  correspond  to  the  'main  topics'  in  an  outline,  and  are  therefore 
located  at  the  upper  levels  of  the  tree.     When  the  words  on  these  upper  levels  are 
considered  apart  froni  the  lower-level  structures  which  modify  them,  they  often 
summarize  the  content  of  the  sentence  in  a  sort  of  'newspaper  headline'  or  'tele- 
graphic style'." 

The  problems  of  multi-level  selection,  or  screening,   such  that  machine  programs 
for  selection  of  the  most  probably  significant  words,  phrases,  or  sentences  can  be 
focussed  upon  the  most  probably  content- relevatory  areas  of  text,  are  treated  here,  as 
also  by  Salton,  in  the  sense  of  a  cutting-off  at  a  given  depth  in  the  analyzed  syntactic 
structure.  —  A  potentially  important  contribution  to  the  future  prospects  for  automatic 
indexing,  however,  lies  in  the  "discourse  analysis"  aind  "transformational  linguistics" 
approach  of  Harris  (1959  [254]),  where  condensations  eind  concentrations  of  similarities 
and  differences  of  topical  interest  may  hopefully  be  achieved. 

Harris  himself  suggested,  at  least  as  early  as  1958,  applications  of  his  approach  to 
both  automatic  indexing  and  abstracting.    A  goal  of  the  analyses  he  has  proposed  is  to 
identify  'kernels'  of  linguistic  expression,  having  first,  by  various  transformations  such 
as  from  passive  to  active  voice,  brought  together  different  ways  of  saying  the  same  thing. 
He  then  suggests  not  only  machine  operations  to  normalize  by  application  of  his  trans- 
formational rules  but  also  to  determine: 

".  .  .  Which  kernels  have  the  same  centers  in  different  relations  (e.  g.  ,  with 
different  adjuncts),   and  other  characterizing  conditions.    The  results  of  this 
comparison  would  indicate  whether  a  kernel  is  to  be  rejected  or  transformed 
into  a  section.  .  .  of  an  adjoining  kernel,   or  stored,  and  whether  it  is  to  be 
indexed,   and  perhaps  whether  it  is  to  be  included  in  the  abstract.  "  J^/ 
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Levery.   1963  [  359],  p.  236. 
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Plath,  1962  [474],  pp.  189-190. 
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See  also  Thome,   19c>2  [bOS],  p.  v:    "The  approach  followed  requires  that  the  com- 
puter itself  syntactically  analyse  input  text  in  order  to  convert  it  into  special  form 
called  FLEX,  which  preserves  only  that  syntactic  information  which  is  useful  for 
data  retrieval  purposes." 
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Certain  difficulties  are  self-evident.    Consider,  for  example,  the  admittedly  hypothetical 
text  which  might  refer  in  various  places  to  the  "dissolute,  disreputable,  illiterate,  elder 
Lincoln"  (vmderlining  supplied)  and  which  might  be  so  processed  by  machine  as  to  imply 
I  that  Lincoln  the  son  was,  although  also  President  of  the  United  States,  "dissolute,  " 
'  "disreputable,"  "illiterate,"  and  "elder."     These,  however,  are  difficulties  that  plague 
almost  any  machine  processing  of  natural  language  text. 

Climenson,  Hardwick,  and  Jacobson  have  explored  some  of  the  possibilities  of  the 
j  Harris  approach  in  experimental  computer  programs  for  the  RCA  501  (1961  [  133]  ). 
Specific  features  of  these  programs  include: 

1.  Establishment  of  the  syntactic  class  or  classes  to  which  a  given  word  can 
belong,  by  dictionary  lookup. 

2.  Investigations  of  sentence  structure  and  context  in  an  attempt  to  resolve  the 
homographic  ambiguities  involved  when  the  same  word  may  function  either 
as  a  novm  or  a  verb. 

3.  Isolation  and  marking  of  sentence  segments,   such  as  noun  phrases,  pre- 
positional phrases,  adverbial  phrases,  and  verb  phrases. 

4.  Identification  and  marking  of  segments  --  clauses  or  degenerate  clauses. 

On  a  very  preliminary  basis,  a  limited  set  of  word  and  phrase  deletion  rules  were 
set  up  and  several  sample  documents  were  processed  against  them,  yielding  reductions 
to  about  35  percent  of  the  original  text.    These  results  suggest  that  "syntactical  filtering 
criteria"  might  be  applied  to  the  improvement  of  modified  derivative  indexing  techniques, 
such  as  the  word-frequency  counting  techniques,  either  by  deleting  syntactically  insignifi- 
cant parts  of  selected  sentences,  or  by  counting  identical  phrases  rather  than  words.  The 
investigators  conclude,  however,  that: 

"A  formal  linguistic  approach  to  the  problems  of  natural  language  processing 
promises  to  yield  results  vital  to  the  success  of  automatic  indexing  and  data 
extraction.     But  the  work  required  in  such  an  approach  will  be  quite  arduous; 
a  long-range  man-machine  effort  will  be  required  to  formulate  practical 
machine  programs  for  indexing  and  abstracting.  " 

A  final  special  case  of  linguistic  data  processing  involving  syntactic  analysis  is 
that  of  Langevin  and  Owens.    They  claim: 

"A  critical  review  of  the  analysis  work  done  on  the  Nuclear  Test  Ban  Treaty 
by  use  of  the  Multiple  Path  Syntactic  Analyzer  demonstrates  that  such  a  device 
can,   even  at  present,  provide  a  powerful  technique  for  the  systematic  discovery 
of  ambiguities  in  treaties  and  other  documents.    Because  the  analyzer  operates 
without  bias  from  the  overall  context  of  the  document,  it  may  sometimes  be 
possible  for  it  to  discover  ambiguities  that  would  easily  escape  a  human  reviewer 
who  knows  what  the  document  is  'supposed  to  say'.  "  — ^ 
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Climenson  et  al,   1961  [  133],  p.  182. 
Langevin  and  Owens,   1963  [346],  p.  26. 
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6.  4     Probabilistic  Indexing  and  Natural  Language  Text  Searching 

As  in  the  case  of  automatic  indexing  proposals  based  upon  automatic  sentence 
extraction  techniques,  machine  searching  of  full  natural  language  text  has  been  suggested 
as  a  basis  for,  at  least,  automatic  derivative  indexing.    We  have  remarked  previously 
that  the  machine  use  of  complete  text  can  only  be  considered  to  be  "indexing"  in  a  very 
special  sense,  that  it  is  subject  either  to  the  non-availability  of  suitable  corpora  already 
in  machine-usable  form  or  to  high  costs  of  conversion  to  this  form,  and  that  too  little 
is  yet  known  of  linguistic  analysis  and  searching-selection  strategies  effectively  applicabl 
to  natural  language  materials.    Various  examples  of  corroborating  opinion,  other  than 
those  previously  cited,   are  as  follows: 

"Machine  searching  is  superb  if  it  is  known  exactly  how  to  describe  the  object  of 
search,   and  if  one  could  know  how  to  choose  from  among  many  possible  search- 
ing strategies.    I  doubt  if  any  one  is  yet  in  this  comfortable  position  with  respect 
to  machine  searching  of  text.  " 

"The  most  effective  programs  in  automatic  linguistic  analysis  have  served  only 
to  illustrate  how  really  complex  is  the  structure  of  the  language,  and  how  far 
removed  the  present  state  of  the  art  is  from  any  system  which  might  be  useful 
in  practice.  "^/ 

"The  recognition  of  words  involves  only  the  matching  of  digital  codes,  but 
the  recognition  of  an  idea  is  a  severe  intellectual  problem,  the  solution  to 
which  will  probably  never  be  exact.    Nevertheless,  this  is  the  problem  which 
must  be  attacked  if  accuracy  is  ever  to  be  attained,  or  even  approached,  in 
using  the  text  of  information  items  as  a  basis  for  their  recovery.  "  ^/ 

Nevertheless,   some  of  the  work  both  in  natural  language  text  searching  and  in 
"probabilistic  indexing"  (where  weights  representing  judgments  as  to  degree  of  relevance 
of  an  indexing  term  to  an  item  are  used  either  in  indexing  or  search),  provide  instructive 
insights  into  some  of  the  problems  of  automatic  indexing. 

In  the  period  1958-1960,  work  at  Ramo- Wooldridge  resulted  in  the  release  or 
publication  of  provocative  papers  by  Maron,  Kuhns,  ajid  Ray  on  "probabilistic  indexing" 
(1959  [398],   i960  [397]  )  and  by  Swanson  on  natural  language  text  searching  by  computer 
(i960  [  587,  582],   1963  [  583]  ).    Subsequent  work  along  these  lines  has  included  further 
developments  at  Thompson  Ramo-Wooldridge,  the  law  statutes  work  at  the  Health  J-,aw 
Center  at  the  University  of  Pittsburgh,  and  the  experimental  investigations  of  Eldridge 
and  Dennis  in  a  project  jointly  sponsored  by  the  American  Bar  Foundation,  IBM,  and  the 
Council  on  Library  Resources. 
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6.  4.  1    Probabilistic  Indexing  -    Maron,  Kuhns,  and  Ray 

The  work  in  the  area  of  "probabilistic  indexing"  involves,  as  in  the  case  of  Stiles' 
statistical  association  factors,  an  assumption  that  there  should  be  machine  means  avail- 
able for  the  automatic  elaboration  of  search  requests  in  order  that  relevant  documents  not 
indexed  by  the  precise  terms  of  these  requests  may  be  retrieved.    Given  that  measures  of 
"closenesses"  and  "disteinces"  between  similar  documents  can  be  obtained,  probabilistic 
weighting  factors  between  index  terms  assigned  to  documents  may  be  made  explicit. 
More  generally,  however,  the  notion  of  probabilistic  indexing  is  based  upon  the  assign- 
ment of  weights  that  provide  a  numerical  evaluation  of  the  probable  relevance  of  index 
terms  to  a  particiilar  document,  and  of  the  relative  importance  of  the  various  terms 
used  in  a  search  request.    Maron  and  Kuhns  (1963  [  397]  )  thus  consider  the  following 
variables  important  in  the  formulation  and  following  out  of  search  strategies : 

1.  Input-  both  the  terms  of  the  request  and  the  weights  assigned  to  them. 

2.  A  probabilistic  matrix  giving  dissimilarity  measures  between  documents, 
significance  measures  for  index  terms,  and  closeness  measures  between 
index  terms. 

3.  A  priori  probability  distribution  data. 


4.  Output-  a  class  of  retrieved  documents  ranked  in  order  of  their  "computed 
relevance  numbers"  and  an  indication  of  the  number  of  documents  involved 
in  the  class. 

5.  Search  parameter  controls,  such  as  the  number  of  documents  desired, 

6.  Search  prescription  renegotiation  involving  amplification  of  the  request  by 
adding  terms  "close"  to  the  ones  in  the  original  request  and  the  selection 
of  additional  documents  following  distance  criteria  for  the  collection,  l.'^ 

Experiments  have  been  reported  for  40  requests  run  against  110  articles  taken  from 
Science  News  Letter.    Without  search  renegotiation,  the  "answer"  document  was 
retrieved  in  only  27  of  the  40  tests.    Three  alternative  methods  of  request  elaboration 
were  then  tried.    First,  additional  terms  most  strongly  implied,  statistically,  by  the 
terms  in  the  request  were  used.    Secondly,  those  terms  were  added  which  most  strongly 
imply,  again  in  a  statistical  sense,  each  of  the  given  request  terms.    Thirdly,  co- 
efficients of  association  between  index  terms  were  used.    Results  are  reported  as  follows: 

"(1)       Using  the  method  of  request  elaboration  via  forward  conditional 

probabilities  between  index  tags,  we  retrieved  the  correct  ajiswer 
document  in  32  cases  out  of  the  40. 

(2)  Elaborating  the  requests  via  the  inverse  conditional  probability  heuristic, 
we  retrieved  the  correct  document  in  33  of  Jhe  40  cases. 

(3)  Using  the  coefficient  of  as  sociation  to  obtain  the  elaborated  request  we 
obtained  success  in  33  cases  of  the  40. 
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"Thus  we  see  that  the  automatic  elaboration  of  a  request  does,  in  fact,   catch  rele- 
vant documents  that  were  not  retrieved  by  the  original  request.  "  ]J 

6.  4.  2    Natural  Language  Text  Searching    -  Swanaon 

The  work  in  automatic  indexing  and  related  research  directed  by  Swanson  at  Ramo 
Wooldridge  Corporation  has  included  "indexing  at  the  time  of  search"  in  natural  language 
text  searching,  (I960  [  582,   587],   1963  [  583]),  the  previously  mentioned  studies  of 
machine-like  indexing  by  people  (Montgomery  and  Swanson,   1962  [42l]),  and  automatic 
assignment  indexing  using  pre-selected  lists  of  clue  words,  (Swanson,   1963  [580]).  The 
last  of  these  three  major  areas  of  investigation  is  the  one  of  the  greatest  interest  in  this 
present  study,  but  the  earlier  experiments  in  machine  searching  of  natural  lajiguage  texts 
warrant  some  discussion.    In  his  reports  on  this  text  searching  project,  Swanson  has 
specifically  claimed  that  the  methods  for  transforming  search  questions  can  serve  as 
the  basis  for  an  automatic  indexing  method.  Thus: 

".  .  .  A  technique  for  automatic  indexing  Ccin  be  derived  immediately  from  a  text 
searching  technique.  .  .  it  is  necessary  only  to  so  organize  the  machine  procedures 
that  those  operations  of  text  reduction  or  reorganization  common  to  all  searches 
are  performed  only  once  and  prior  to  searching  in  order  to  create  directly  an 
automatic  indexing  procedure.  "  ^1 

Swanson  has  also  claimed  that  if  automatic  searching  of  full  text  is  not  feasible, 
then  automatic  indexing  is  not  feasible,  the  one  being  prerequisite  to  the  other.  For 
example: 

"Clearly,  if  a  computer  technique  for  search  and  retrieval  from  the  full  text 
of  a  collection  of  documents  Ccinnot  be  developed,  then  it  is  unthinkable  that 
matters  could  be  improved  by  using  the  machine  to  operate  on  just  part  of 
the  information  (a  'condensed  representation')  --  that  is,  on  an  automatically 
produced  index.    This  line  of  argument  demonstrates  persuasively  that  the 
development  of  techniques  for  automatic  full-text  search  and  retrieval  is  a 
prerequisite  to  automatic  indexing.    It  is  equally  clear  that  a  technique  for 
automatic  indexing  can  be  derived  immediately  from  a  text-searching  tech- 
nique, and  thus  that  the  two  processes  involve  conceptually  equivalent 
problems.  "  ^1 

In  the  actual  text  searching  experiments,  a  model  "library"  consisting  of  100  short 
articles  in  the  field  of  nuclear  physics  was  set  up  in  machine-usable  form.    These  articles 
were  also  studied  by  subject  specialists  who  rated  the  relevance  of  each  paper  to  each  of 
50  questions,  and  assigned  weighting  factors  representing  the  degree  of  judged  relevance. 
A  second  group  of  people,  who  knew  only  that  the  papers  were  in  the  field  of  nuclear 
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physics,  then  transformed  the  50  questions  into  search  prescriptions  using  three 
different  methods.    The  first  method  for  the  development  of  the  search  instructions  was 
to  choose  appropriate  index  entries  from  a  subject  heading  list  tailored  to  the  contents  of 
the  sample  library.    Search  was  then  made  manually  against  a  card  catalog  v/hich 
recorded  the  results  of  manual  indexing  of  the  same  100  articles  to  the  entries  of  this 
list. 

The  second  method  of  search  prescription  tested  involved  the  specification  of 
combinations  of  words  and  phrases  likely  to  be  found  in  any  paper  which  would  in   fact  be 
relevant  to  the  search  question.     The  third  method  involved  modification  of  the  second  by 
the  use  of  a  thesaurus -type  glossary  which  suggested  various    alternative  terms.  Both 
the  latter  two  types  of  search  instructions  were  fed  to  a  computer  program  which  carried 
out  searches  against  the  natural  language  text  consisting  of  250,000  words  from  the 
original  articles. 

The  results  were  then  evaluated  in  terms  of  ratings  of  relevance  made  by  the 
physicists  who  had  analyzed  the  papers.    Retrieval  effectiveness  was  not  high:    ".  .  .in  no 
case  did  the  average  amoxxnt  of  relevant  material  .  .  .  retrieval  (taken  over  50  questions) 
exceed  42  per  cent  of  that  which  was  judged  ...  to  be  present  in  the  library.  "  }_l  However, 
the  results  were  indicative  of  the  superiority  of  the  machine  methods  to  the  manual  cata- 
log search. 2/For  this  library  in  particular,  in  the  case  of  "source  documents"  (the 
articles  from  which  the  search  questions  were  taken),   only  38  percent  of  the  relevant 
papers  were  located  by  the  manual  search,  whereas  68  percent  of  the  relevant  items 
were  retrieved  by  machine  search  of  the  text  for  specified  words  and  phrases  in  various 
"and"  and  "or"  combinations.    Machine  search  based  on  search  instructions  that  had  been 
developed  with  the  assistance  of  the  thesaurus-glossary  yielded  86  percent  of  the  relevant 
source  item  documents. 

6.  4.  3    Full  Text  Searching  -  Legal  Literature 

"The  retriever  of  documents  may  be  satisfied  with  a  sample  of  descriptors  that 
represent  the  contents;  the  fact  retriever  or  the  question  answerer  must  often  have 
access  to  every  word  in  the  text".  ^1   The  objective  of  fact  retrieval  is  a  major  goal  in 
the  experimentation  that  is  being  carried  forward  in  the  field  of  natural  language  text 
searching  of  legal  material,   especially  the  texts  of  statutes  of  the  State  and  Federal 
Governments.     The  most  extensive  program  to  date  is  that  of  Horty  and  his  colleagues 
at  the  University  of  Pittsburgh  Health  Law  Center  (I960  [277],   1961  [276,   309],  1962 
^196,  278],   1963  [24,  280]). 

Wilson  at  the  Southwestern  Legal  Foundation  is  experimenting  with  a  modified 
version  of  the  Horty-Pittsburgh  System  for  legal  cases  dealing  with  arbitration  in  five  of 
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Swanson,   I960  [  582],  p.  25. 
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Ibid,  p.  1:    "On  the  whole,   retrieval  effectiveness  was  rather  poor,  yet  machine 
search  of  the  text  of  the  model  library  was  significantly  better  than  was  human 
searching  of  the  subject  heading  index.  " 

Simmons  and  McConlogue,   1962  [  555],  p.  3. 
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the  southwestern  states.—      A  joint  American  Bar  Foundation--IBM  research  program  has 
been  established  to  explore  both  text  searching  without  prior  indexing  and  automatic  in- 
dexing techniques  (Eldridge  and  Dennis,  1962  [l83L  196  3  [182]). 

In  the  Horty-Pittsburgh  System,  approximately  6,  000,000  words  of  text  have  been 
converted  via  Flexowriter  to  magnetic  tape.    An  exclusion  dictionary  of  100  words  is  used 
to  eliminate  the  most  common  words  and  a  word-concordance  is  prepared,   resulting  in 
word- occur rence  location  indicia  by  position  in  sentence,  paragraph  and  section  of  the 
statute.    In  searching,  the  user  has  available  to  him  the  alphabetized  list  of  approximately 
17,  000  different  words  and  it  is  up  to  him  to  think  of  the  words  and  synonyms  most  likely 
to  occur  in  statute  sections  likely  to  be  the  ones  he  seeks.    Several  search  logics  are 
available.    One  provides  that  at  least  one  of  a  group  of  alternate  words  must  appear; 
another  requires  that  at  least  one  from  two  or  more  groups  must  appear  in  the  same 
sentence.    Intra- sentence  distance  criteria  are  also  utilized:    "If  the  phrase  'born  out  of 
wedlock'  is  sought,  the  operator.  .  .  requires  that  the  word  'wedlock'  appear  in  the  same 
sentence,  no  more  than  three  words  after  'born'.  " 

Obviously,  for  the  same  question  the  searcher  would  also  have  to  specify  synony- 
mous words  and  phrases--"illegitimate  children",  "illegitimate  births",  "unwed  mothers", 
"unmarried  mothers",  "illegitimacy",  "bastardy",  and  so  on.     The  reported  success  of 
the  system  is  apparently  due  in  large  part  to  the  ingenuity  of  the  searchers  in  specifying 
the  expressions  and  synonyms  most  likely  to  be  used.    Hughes  comments  as  follows: 

"It  should  be  noted  that  this  system  will  be  most  efficient  only  when  the  users 
are  thoroughly  familiar  with  the  linguistic  style  of  the  source  material  and 
search  is  made  on  words  known  to  occur  in  the  appropriate  statutes". 

6.  5     Other  Examples  of  Related  Research  in  Linguistic  Data  Processing 

Since,  as  Garvin  has  emphasized,  "All  areas  of  linguistic  information  processing 
are  concerned  with  the  treatment  of  the  content,   rather  than  merely  the  form,  of  docu- 
ments composed  in  a  natural  language,  "  — ^    much  of  the  research  in  linguistic  data 
processing  is  potentially  applicable  to  both  the  development  and  the  improvement  of 
automatic  indexing  techniques.    Thus  developments  in  automatic  content  analysis,  in 
psycholinguistics,  in  question-answering  systems,  may  eventually  find  application  to 
mechanized  indexing  systems. 


1/ 

Eldridge  and    Dennis,   1964  [  182],  p.  90;  Wilson,   1962  [  645]. 
Horty,  1962  [278],  pp.  59-60. 
Hughes,   1962  [284],  p.  IV-6  to  IV-8. 
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In  terms  of  our  present  concern,  however,  we  shall  select  only  a  few  examples. 
I  "By  automatic  content  analysis  is  meant  the  use  of  computer  programs  to  detect  or  select 
1  content  themes  in  a  sentence-by-sentence  scanning  of  text  or  verbal  protocols".  ]_l  The 
I  interest  of  psychologists  in  machine  techniques  to  assist  in  the  analysis  of  linguistically- 
I  given  materials,  as  in  propaganda  analysis,  probably  precedes  at  least  in  sophistication 

if  not  by  date,  that  of  documentalists  or  of  machine  specialists  interested  in  library  and 

information  problems . 

3/ 

The  "General  Inquirer"  program  developed  by  Stone  et  al  ,  —  is  an  example  of 
question-answering  techniques  based  upon  selective  extractions  from  natural  language 
i  text.    It  involves  the  use  of  a  master  vocabulary  consisting  of  words  previously  selected 
I  by  an  investigator  as  being  likely  to  be  content-indicative  in  a  body  of  material  to  be 
i  processed,  together  with  his  pre-established  indications  of  the  categories  he  expects 
I  their  occurrence  should  predict.    It  is  to  be  noted  that  this  is  a  custom-tailored  set  of 
j  categories  and  of  clue-word  lists    associated  with  each,  manually  pre-established.  Text 
I  is  now  processed  in  such  way  that  each  word  is  looked  up  and,  if  it  appears  in  the  master 

vocabulary,  it  is  tagged  with  identifiers  of  the  categories  for  which  it  is  presumably 
I  predictive.    A  subsequent  "Tag  Tally"  routine  then  counts  the  tag  frequencies  to  deter- 
'  mine  for  which  categories  the  input  material  has  high  or  low  scores,  and  these  in  turn 
j|  can  be  compared  with  expected  norms. 

This  type  of  program  has  been  applied  to  such  varied  materials  as  suicide  notes, 
folk  tales  from  different  cultures,   reports  of  field  workers,  recordings  of  group  dis- 
cussions as  in  supervisory-leadership  training  sessions,  and  protocols  for  various 
psychological  tests.  ^1    Interesting  variations  developed  by  Jaffe  and  others  involve 
the  use  of  non-verbal  as  well  as  verbal  clues  as  content-indicators,  specifically,  time- 
I  sequence  patterns  recorded  along  with  the  words  spoken  in  client- therapist  sessions.  At 

the  meeting  of  the  Association  for  Computational  Linguistics  and  Machine  Translation 
I  held  in  Denver,  August,   1963,  Jaffe  reported  findings  indicative  of  positive  correlation 
1  between  the  structure  of  temporal  and  lexical  patterns  in  dialogue  and  suggested  applica- 
tions to  automatic  abstracting  or  indexing  by  the  use  of  the  time- sequence  patterns  as 
clues  to  high  information- value  areas. 


'  Ford.  Jr.  .  1963  [498],  p.  3. 

See,  for  example,  Jaffe  1952  [297],  Hart  and  Bach,  1959  [256],  Pool,  1959  [475  ], 
the  latter  covering  the  proceedings  of  a  conference  held  in  1955. 


Stone  and  Hunt.  1963  [576 J;  Stone  et  al,  1962  [575]. 


±1 

See  Ford,  1963  [498],  p.  8. 

See  for  example,  Cassotta  ,  et  al,  1964  [104]  ;  Jaffe,  [294  J  to  Id.'j  ( \. 
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Hughes  provides,  as  of  September,   1962  ([284]  ),  a  critical  review  of  several 
experimental  and  proposed  question-answering  systems  using  natural  language  statements 
and  natural  language  queries,  including  "BASEBALL",  1/  "SAD  SAM"  ^/  and  the  "Proto 
Synthex"  investigations  of  System  Development  Corporation.  —I  Later  developments  on 
the  Synthex  (synthesis  of  complex  verbal  material)  project  at  SDC  have  included  a 
variation  on  a  natural  language  text  searching  program  where  ordinary  text  input  is  run 
against  an  exclusion  list  and  a  table  is  set  up  to  tally  the  substantive  words  remaining. 
Words  with  the  same  roots  or  previously  having  been  identified  as  synonymous  are  cross- 
referenced.    A  complete  index  results,    with  document  location  identifier  tags  for  the 
word  occurrences  down  to  the  single  sentence  level.    This  index  can  be  used  subsequently 
to  locate  regions  of  text  (volume,  chapter,  paragraph,  and  sentence)  where  answers 
responsive  to  input  questions  are  likely  to  be  foxind. 

It  is  proposed  that  the  Synthex  system  eventually  should  incorporate  analyses  of 
syntactic  and  semantic  relationships  in  the  linguistic  expressions  of  both  queries  and 
text.    Of  future  interest  in  the  extension  of  such  considerations  to  automatic  indexing  and 
abstracting  are  the  following  comments: 

"The   results  of  several  early  experiments  within  the  project,  coupled  with  the 
findings  of  other  language  researchers,  led  to  the  following  conclusions  about 
meaning  and  grammatical  structure  in  English  text: 

1.  The  degree  of  synonymity  in  meaning  between  any  two  English 
words  can  be  measured  quantitatively  with  a  synonym  dictionary 
and  relatively  simple  scoring  procedures. 

2.  The  difference  in  meaning  between  two  sentences  of  identical 
syntactic  structure  can  be  expressed  quantitatively  as  a  function 
of  synonymity  of  their  words.  .  .  " 

It  is  also  of  interest  to  note  that  although  the  "indexer"  program  of  the  Synthex 
system  provides  cross-referencing  between,  for  example,  "whales"  and  "whaling"  or 
"England"  and  "Great  Britain",  the  investigators  admit  that:    "naturally  it  falls  short  of 
such  complicated  cross-referencing  as  'mouse-animal'  'Jones  person'  and  other 
concept  recognitions.  "       However,  concept  recognitions  based  upon  both  a  priori  and 


u 

See  also  Green  et  al,  1961  [238  ]. 
See  also  Lindsay,  I960    [  363  ]. 

See  also  Klein  and  Simmons,  1961  [325  j  ;  Simmons  etal  [552]  to  [555]. 
System  Development  Corporation,   1962  [  590]  . 

5/ 

Simmons  and  McConlogue,  1962  [  555  1  ,   p.  70. 
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a  posteriori  associations  are  at  least  foreshadowed  in  a  small-scale  model  of  attribute- 

 i-   1  / 

I  words  and  proper  names,  together  with  prespecified  relationships  between  them;  in 

Olney's  recent  work  at  SDC  exploring  the  possibilities  for  use  of  cognitive  concepts  as 

bases  for  establishing  association  between  documents,  2^/and  by  Kochen's  work  on  machine' 

3  / 

I  inference  and  concept  processing.  _' 

i 

A  final  example  of  potentially  related  research  in  the  area  of  content  analysis  is 
therefore  the  work  of  Kochen,  Abraham,  Wong  and  others  at  IBM's  Thomas  J.  Watson 
Laboratories  (1962  C  329]  ).     While  concerned  principally  with  adaptive  organization  and 

i  processing  of  stored  factual  statements  and  the  possibilities  for  machine  formulation 

'  of  "hypotheses"  about  these  and  additional  facts,  some  consideration  has  been  given  to 
sampling  procedures  applicable  to  determination  of  similarity  which  might  be  used  for 
document  clustering  and  to  the  possibilities  for  dynamic  clustering  for  retrieval  based 
upon  a  specific  individual  query.  —    In  the  proposed  AMNIP  (Adaptive  Man-machine  Non- 

I  Arithmetical  Information  Processing)  system,,  there  is  no  attempt  at  either  automatic 
indexing  or  automatic  abstracting.  ^1  Instead,  formal  statements  are  made  about  named 

i   "things"  and  their  attributes.     The  sharing  of  common  attributes  then  serves  as  a  basis 
for  relating  items  which  are  similar  and  for  grouping  them  together  in  the  system 
memory.    It  is  assumed  that  the  organization  of  the  stored  statements  changes  dynami- 
cally with  new  data  inputs  and  user  feedback  in  question-answering  routines. 

I  Where  the  named  items  are  nam.es  of  documents  or  of  index  terms,  a  number  of 

documentation  applications  can  be  considered.    Where  the  items  are  document  names 
and  the  formal  predicate  is  "cites",  the  system  provides  a  procedure  for  production  and 
use  of  citation  indexes.  ^1   Where  the  items  are  index  terms  or  subject  headings  and 
the  predicates  are  "is  used  synonymously  with"  or  "is  subsumed  under",  machine 
construction  of  a  growing  thesaurus  based  on  use  is  suggested.  — The  common  attribute 


II 

See  Stevens,   I960  [  568]  ;  see  also  Herner,   1962  [266],  p.  5. 

II 

See  Borko,   1962  [75],  p.  5:    "Instead  of  defining  meaning  in  terms  of  synonyms.  .  . 
it  is  defined  in  terms  of  the  entities  referred  to  by  the  word  in  context.    A  chair 
is  thus  described  as  belonging  to  a  class  defined  by  a  given  list  of  properties.  .  . 
Analysis  yields  an  interpretation  of  the  sentence  as  an  assertion  that  certain 
relationships  hold  between  the  specified  referent  classes.    The  cognitive  content 
of  the  sentence  is  a  function  of  this  assertion  plus  the  information  about  these 
referent  classes  which  has  previously  been  stored  in  memory.  " 

Kochen  et  al,   1962  [  329]  . 

Ibid,  Appendix  by  C.  T.  Abraham,  pp.  20-65. 

II 

Kochen  et  al,   1962  [328],  p.  45. 
Ibid,  p.  37. 

y 

Ibid,  p.  37. 
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matching  program,  applied  to  logical  similarities  of  texts  related  as  by  having  various 
assigned  descriptors  or  citations  in  common,  might  provide  a  basis  for  generating 
document  surrogates  by  representing  each  text  in  a  related  group  of  texts  with  the  words 
or  sentences  these  texts  have  in  common,  i^/ 

In  the  case  of  man-machine  interaction  during  search,  it  is  suggested  that  the  user 
should  indicate  the  names  of  selected  documentary  items  which  are  of  particular  interest, 
then: 

"The  machine  forms  an  'hypothesis'  about  the  subset  of  articles  likely  to  be  of 
interest.    It  does  this  by  examining  all  recorded  statements  common  to  the  ones 
selected  but  not  to  the  rejected  ones.     The  weight  of  different  attributes  and 
degree  of  interest  is  taken  into  account.     The  machine  may  display  this  hypo- 
thesis or  another  random  sample  of  titles  consistent  with  it,  or  both.  " 

6.  6     Machine  Assistance  in  Translations  of  Subject  Content  Indications  to  Special  Search 
and  Retrieval  Language 

There  are,  also,  in  the  areas  of  directly  and  indirectly  related  research,  certain 
programs  of  research,  development,  and  experimentation  which  include  investigations 
of  possibilities  for  using  machines  to  assist  in  the  "translation"  of  textual  languages  into 
special  intermediate  or  "documentary"  languages.    Doyle's  use  of  the  inclusion  list 
principle  to  extract  specified  content-indicative  words  and  to  encode  them  in  his  "bigram" 
index  was  an  early  but  relatively  trivial  example.  -/     The  work  of  Williams  and  her 
associates,  at  Itek  ajid  elsewhere,  ^1  has  involved  the  objectives  of  determining  which  of 
the  subject- revealing  implications  of  titles,  abstracts  and,  if  necessary,  full  text,  are 
susceptible  to  machine  detection  and  manipulation  such  that  the  implied  as  well  as  the 
explicit  assertions  made  in  a  document  may  be  incorporated  in  a  formalized  language  for 
retrieval. 

While  Williams,  Barnes,  Cardin  and  Levy,  and  others,  have  so  far  approached 
such  tasks  primarily  from  the  standpoint  of  human  analytic  judgments,  Coyaud  (1963 
C  143])  has  discussed  at  least  preliminary  work  looking  toward  the  automation  of  the 
anedysis  of  natural  language  texts  for  purposes  of  encoding  and  organization  of  the  terms 
and  relationships  to  be  used  in  the  "documentation  language"  known  as  "SYNTOL" 
(Syntagmatic  Organization  of  Language),  this  work  has  used  a  corpus  based  on  biblio- 
graphic abstracts  from  the  Bulletin  signaletique  of  the  Centre  National  de  la  Recherche 
Scientifique,  Psychophysiology  Section,  for  the  period  1958-1960.  Notwithstanding  such 
difficulties  as  determining  rules  for  proper  subdivisions  of  text,  reduction  of  synonyms, 
resolution  of  lexical  and  syntactic  ambiguities,  and  the  fact  that  some  words  are  always, 

Kochen  et  al,  1962  [  329],  p.  2. 

i/ 

Kochen  et  al,    1962  g28],  p.  7. 

3/ 

Doyle,   1959  [  168].    See  also  p.   123  of  this  report. 

1/ 

See,  for  example,   T.  M.  Williams,  R.  F.  Barnes,  Jr.,  J.  W.  Kuipers, 
various  references. 
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j  but  some  never,  used  in  SYNTOL  itself,  he  reports  that  both  substamtives  aind  textual 
I  expressions  indicative  of  certain  specific  SYNTOL  relations  can  be  unambiguously  identi- 
fied.   Contextual  clues  are  used:    for  example,  if  the  word  "homme"  occurs  it  is  trans- 
I  lated  as  "sexe  masculin"  if  "femme"  also  occurs,  as  "etre  humain"  if  "animal"  is  also 
!  mentioned,  and  as  "suj  et    experimental"  otherwise. 

Melton  and  her  associates  at  the  Center  for  Documentation  and  Communication 
Research,  Western  Reserve,  have  also  been  investigating  machine  processing  of  input 
i  text  with  a  view  to  the  automatic  selection  and  manipulation  of  clue  words  and  relation - 
!  ships  between  them  for  information  retrieval  purposes.    Their  material  consists  of 
abstracts  from  the  metallurgical  section  of  Chemical  Abstracts.    From  sample  abstracts, 
a  lexicon  is  developed  which  involves  classification  of  words  into  those  that  are  signifi- 
cant from  a  metallurgical  point  of  view;those  that  name  materials,  compounds,  environ- 
j  ments;  those  denoting  processes;   those  denoting  characteristics  of  materials;  preposi- 
tions; those  which  will  not  operate  in  the  analysis  of  the  text,  ajid  the  like. 

On  the  basis  of  analysis  of  a  nimnber  of  sentences  from  the  sample  text,   rules  for 
combination  and  selection  of  specified  words  in  specified  relationships  can  be  set  up. 
These  rules  are  designed  to  identify  sentence  types  which: 

(1)  Describe  performance  of  a  process  on  a  material. 

(2)  Discuss  a  material  in  terms  of  properties,  components,  form,  or 
environment. 

(3)  Describe  a  process  without  reference  to  specific  materials. 

(4)  Discuss  metallurgical  properties  without  reference  to  specific 
materials. 

(5)  Discuss  two  or  more  materials,  properties  or  processes. 

(6)  Describe  a  causal  relationship  between  two  properties. 

(7)  Give  a  comparison  of  materials. 

(8)  Contain  no  words  of  interest  in  the  system. 

Computer  programs  to  explore  the  possibilities  for  automatic  analyses  of  the 
kind  developed  manually  for  the  sample  abstracts  will  be  written  with  the  objective  of 
finding  an  effective  compromise  between  mere  word  identification  and  total  linguistic 
j  cinalysis.    Melton  says: 

"If  one  considers  this  method  of  analysis  from  the  point  of  view  of  the  linguist, 
he  can  immediately  describe  many  grammatical  constructions,  which  will 
prevent  the  meajiingful  reduction  of  these  sentences.    It  is  not  known  at  this 
time  how  often  such  sentences  will  appear  in  the  corpus  of  this  investigation. 
Nor  is  it  known  how  adversely  such  failure  would  affect  the  retrieval  of  the 
information  in  these  sentences.     The  answers  to  these  questions  will  be 
available  only  after  a  large  sample  has  been  analyzed  and  put  to  an  extensive 
retrieval  test.    At  its  most  successful  the  project  will  achieve  an  automatic 
processing  of  metallurgical  text  which  will  permit  retrieval  of  the  type  of 
information  which  Ccin  be  stated  in  its  own  terms  with  a  tolerable  amount  of 
inappropriate  selections.    Should  this  goal  be  unattainable,  the  project  will 
have  generated  a  file  of  abstracts  automatically  searchable  on  the  word  level 
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or  somewhat  beyond.    For  the  benefit  of  other  research,  it  will  also  have 
produced  tapes  of  the  true  text  of  a  large  sample  of  natural-language  ab- 
stracts and  a  lexicon  containing  all  the  words  of  a  corpus  of  current 
scientific  literature.  "  1/ 

6.  7     Example  of  a  Proposed  Indexing-System  Utilizing  Related  Research  Techniques 

In  addition  to  the  automatic  assignment  indexing  and  automatic  classification 
techniques  for  which  experimental  results  have  been  reported,   several  other  techniques 
and  programs  have  been  proposed.    One  is  the  joint  American  Bar  Association-IBM 
research  program  (Eldridge  and  Dennis,   1963  [  1821),  for  which  discussion  has  been 
deferred  because  of  its  proposed  use  of  several  of  the  research  techniques  covered 
previously  in  this  section.     The  experimental  corpus  will  consist  of  the  full  text  of 
approximately  5,  000   legal  case  reports  taken  chronologically  from  the  Northeastern 
Reporter.    Approximately  half  of  this  material  will  be  processed  to  obtain  word  frequency 
counts.     The  frequencies  will  then  be  used  to  prepare  for  each  different  word  an  estimate 
of  the  skewness  of  its  distribution  in  the  collection.     The  investigators  will  then  personally 
inspect  the  word  list  as  ordered  by  skewness  to  divide  it  into  "non-informing"  (Type  I 
words,  or  an  exclusion  list)  and  "informing"  (Type  11  words,   or  an  inclusion  list)  at  some 
appropriate  cutting  point.    Then,  for  each  document,   a  list  will  be  prepared  of  its 
"informing"  (Type  II)  words,  maintaining  order  within  the  document.    For  each  pair  of 
such  words,   statistical  association  factors  will  be  computed.    Eldridge  and  Dennis 
describe  other  aspects  of  their  proposed  technique,  in  part,   as  follows: 

"For  each  document  in  the  body  of  2,  500  cases,  a  list  will  be  prepared  of  its 
Type  II  words,  maintaining  their  original  order  within  the  document  .  .  .  For  each 
Type  II  word  an  'association  factor'  will  be  calculated  for  every  other  Type  II  word 
with  which  it  appears  in  any  one  document  by  conipiling  the  probability  that  Word  A 
would  appear  this  close  to  Word  B  this  number  of  tries  over  the  entire  file,  if 
the  Type  II  words  were  distributed  at  random.    (This  amounts  to  borrowing  Stiles' 
idea  of  the  association  factor,  but  implementing  it  with  a  numerical  method  which 
takes  into  account  nearness  of  the  words  within  the  document  as  well  as  the  fact 
that  they  both  occur  in  the  same  document.  )    Since  the  factors  are  probabilities, 
they  will  be  numbers  between  zero  and  one  .  .  .   These  numbers  will  be  used  to 
estimate  the  distances  between  words  in  index-word  space. 

"The  next  step  is  to  construct  from  the  information  about  distances  between  pairs  of 
words  an  index-word  space  in  whicli  every  word  is  at  the  correct  (or  approximately 
correct)  distance  from  every  other  word  in  the  system  with  which  it  exhibits 
association.     The  result  of  this  operation  can  be  visualized  schematically  as  a  sort 
of  grid  in  which  every  word  can  be  placed  in  its  appropriate  position  by  assigning 
it  a  set  of  coordinates.  " 


Melton,   et  al  1963  [414],  pp.  14-15. 
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"Indexing  of  the  remaining  cases  in  the  experiment  will  be  performed  by  machine 
from  full  text,  using  the  Type  I  list  of  discard  words  and  the  Type  II  list  to  pre- 
pare an  analysis  of  the  frequencies  related  to  index-word  space.    Instead  of 
selecting  specific  words  as  indexing  terms,  concepts  will  be  selected  (statistically) 
as  volumes  in  index-word  space.    A  rough  physical  analogy  to  this  process  would 
be  to  toss  pennies  at  the  previously  mentioned  grid  so  that,  for  every  Type  II 
word  in  the  source  document,   a  penny  lands  at  its  proper  slot  on  the  grid.  Where 
the  pennies  heap  up  in  a  pile,  you  have  a  concept.  " 

"Searching  will  be  carried  out  essentially  by  indexing  a  question  presented 
narratively,  determining  the  concept  volumes  that  represent  the  question,  and 
searching  those  volumes  in  document  space  for  the  relevant  document  numbers. 
Since  the  'edges'  of  the  concept  volumes  are  determined  statistically,  output  can 
be  listed  in  order  of  probable  relevance;  as  an  option  the  question  could  be 
accompanied  by  a  request  that  'at  least  100  references  be  supplied',  in  which  case 
the  concept  boundaries  would  be  adjusted  to  provide  that  number.  "  _!_/ 

It  will  thus  be  noted  that  the  proposed  indexing  and  search  program  begins  on  a 
derivative  basis  to  establish  for  one-half  the  experimental  material  the  significant  words, 
next  combines  word  frequency  with  significant  word  distance  data  to  derive  probabilistic 
association  factors  between  words ,  then  develops  clusters,  and  finally  indexes  the  items 

'I   in  terms  of  the  clusters  rather  than  words  so  as  to  provide  assignment  rather  than 

j   extraction  of  index  terms. 

7.    PROBLEMS  OF  EVALUATION 

We  have  noted,  in  the  introduction  to  this  report,  that  several  fundamental  and 
;  highly  controversial  questions  can  be  raised  with  respect  to  the  feasibility  and  evaluation 
'    of  any  automatic  indexing  scheme  and  with  respect  to  the  evaluation  of  any  indexing 
I   systems  whatsoever.    Yet  if  automatic  indexing  procedures  are  to  be  based  upon  previous 
j  human  indexing  or  if  their  results  are  to  be  compared  with  human  results,  then  the 
I  questions  of  the  quality,   the  reliability  and  the  consistency  of  human  indexing  are  crucial 
ones  indeed.     Thus,  Solomonoff  warns: 

"The  finding  of  exact  languages  for  retrieval  is  also  made  less  likely,  in  view 
of  the  fact  that  the  categorizations  of  documents  that  are  presented  to  the  machine 
as  a  training  sequence  will  not  be  performed  altogether  consistently  by  the  human 
cataloger. "  ^/ 

Montgomery  and  Swanson  ask  whether  human  indexers  are  in  fact  self-consistent  and 
consistent  with  each  other,  and  they  suggest: 


Eldridge  and  Dennis,  1963  [  182],  pp.  97-99. 
Solomonoff,    1959  [562],   pp.  9-10. 


143 


"If  the  answer  turns  out  to  be  'no',  we  might  reasonably  conclude  that  the  only 
reliable  and  effective  kind  of  human  indexing  is  that  which  is  already  machine- 
like in  nature.  "  l.^ 

With  a  few  noteworthy  exceptions,  there  has  been  very  little  serious  investigation  of  these 
problems  and  there  is  very  little  comparative  data. 

O'Connor  iias  been  making  a  series  of  studies,  with  considerable  emphasis  upon  how 
one  might  measure  the  products  of  machine  indexing  and  how  one  might  derive  machine 
rules  for  automatic  index  ing  from  systematic  review  of  documents  indexed  by  people. 
Cleverdon  and  his  associates  at  the  ASLIB  Cranfield  project  have  extensively  tested 
several  different  indexing  procedures.  Painter,  MacMillan  and  Welt,  Slamecka  and 
Zunde,  and  others  report  findings  on  intra-indexer,  and  inter-indexer  consistency  -- 
unfortunately,  on  the  basis  of  quite  small  samples.    Various  alternate  approaches  to  the 
evaluation  of  automatic  indexing  results  have  been  considered  by  Borko,  Doyle,  Swanson, 
Savage,  Giuliano,  and  others.    In  addition,   some  data  bearing  on  these  questions  have 
been  reported  in  connection  with  analyses  of  selective  dissemination  (SDI)  systems. 
Some  data  from  other  sources,   such  as  studies  of  user  preferences  with  respect  to 
^arious  reference  and  search  tools,   is  also  pertinent. 

The  most  generally  accepted  criterion  for  appraising  the  effectiveness  o  f  indexing 
is  that  of  retrieval  effectiveness.    But,  in  general,  this  is  merely  the  substitution  of 
one  intangible  for  another,  entailing  a  string  of  as  yet  unanswerable  or  at  least  un- 
resolved questions.^/  Retrieval  of  what,  for  whom,  and  when?    How  can  effectiveness  be 
measured  except  by  the  elusive  question  of  relevance  judgments?    How  can  human  judg- 
ments of  relevance  and  value  be  measured  and  quantified? 

We  shall  try  to  distinguish  here,  insofar  as  possible,  between  the  core  problems 
that  make  the  evaluation  of  indexing  as  such  an  extremely  difficult  task,  the  available 
data  on  human  indexer  reliability,  and  the  possible  advantages  and  disadvantages  of 
automatic  indexing  techniques. 


Montgomery  and  Swanson,   1962  [42l],  p.  366. 

Compare  Swanson,   I960  [  582],  pp.   2-3:    "The  performance  of  retrieval  experi- 
ments when  relevance  judgments  per  se  cannot  be  consistently  assessed  by  human 
judgment  would  seem  to  represent  overly  vigorous  pursuit  of  a  solution  before 
identifying  the  problem.  "    Similarly,   see  Black,   1963  [64],  p.   14:  "Finally, 
when  one  is  faced  with  an  existing  collection  of  indexed  materials,  how  does  one 
assess  the  effectiveness  of  any  retrieval  system?    Suppose  that  one  receives  20 
documents  as  a  result  of  a  query  to  the  system.    Suppose  further  that  all  20  docu- 
ments   are  quite  pertinent  to  the  topic  of  interest.    Is  there  any  way  to  assess  the 
amount  of  pertinent  information  still  unretrieved  from  the  file?    Or  is  there  any 
way  of  learning  whether  the  retrieved  information  is  more  pertinent  than  the  un- 
retrieved information  ?    Tlie  answer  is  'No!  '  --  the  use  of  any  retrieval  system 
is,   then,   an  act  of  faith  in  the  quality  of  indexing.  " 
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7.  1      Core  Problems 


First  and  foremost  of  the  core  problems  implicit  in  the  question  of  evaluation  of  any- 
indexing  scheme,  whether  applied  by  man,  machine,   or  man-machine  combinations,  are 
those  of  interpersonal  communication  itself,  which  in  turn  relate  to  fundamental  problems 
of  epistemology.    These  are,  first,  the  problems  of  language  as  a  means  of  com  - 
municating  perceptions,  apperceptions  of  relationships  between  present  observations  and 
prior  experience,  and  value  judgments  based  thereon,  and,   secondly,   even  more  funda- 
mentally, the  question  and  the  veridicality  of  language  representations  of  real  transactions 
and  events.    Serious  investigators  in  the  field,  including  many  who  have  themselves  con- 
tributed to  automatic  indexing  techniques,  have  made  such  typical  acknowledgments  of  the 
difficulties  as  the  following: 

"The  imprecision  connected  with  discussion  of  retrieval  effectiveness  and  of 
relevance  is  not  due  to  lack  of  understanding  of  the  relatively  straightforward 
retrieval  processes,  but  is  due  to  our  lack  of  basic  understanding  about  language, 
meaning  and  human  communication  itself.  " 

"Fundamentally,  the  study  of  inquiry  procedures  is  a  problem  in  the  general 
psychology  of    cognitive  functioning.    Relevant  problems  concern  the  way 
problems  are  recognized  and  formulated  into  questions,  the  way  a  search  plan 
is  developed  to  find  answers  to  questions,  and  finally,  the  way  it  is  decided 
whether  or  not  a  possible  answer  matches  the  specifications  of  a  question.  "  ~^ 

A  second  core  problem  is  the  heterogeneous  and  somewhat  arbitrary  development  of 
natural  languages  themselves.    It  is  much  the  same  fundamental  problem  whether  men  or 
machines  are  to  read  text  and  determine  the- "meaning"  (at  least,  in  the  sense  of  com- 
munication intent)  of  messages  expressed  in  a  natural  language.    However,  the  problems 
are  aggravated  if  men  themselves  must  know  enough  about  language  and  its  conveyances 
of  message  content  to  specify  precisely  to  a  machine  what  it  is  to  look  for  and  to  use. 

Salton  enumerates  some  of  these  difficulties  as  follows: 

"No  well-defined  set  of  rules  is  known  by  which  the  individual  words  in  the 
language  are  combined  into  meaningful  word  groups  or  sentences.  Specifically, 
the  correct  identification  of  the  meaning  of  word  groups  depends  at  least  in  part 
on  the  proper  recognition  of  syntactic  and  semantic  ambiguities,  on  the  correct 
interpretation  of  homographs ,  on  the  recognition  of  semantic  equivalences,  on 
the  detection  of  word  relations,  and  on  a  general  awareness  of  the  background 
and  environment  of  a  given  utterance.  "  3^/ 


1/ 

Giuliano,  1963  [  230],  p  .  6. 
Stone,  1962  [  576]  ,  p.  1. 

3/ 

Salton,  1963  [  519],  p.  1-2. 
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Similarly,  Baxendale  states: 


"We  are  confronted  with  difficulties  which  arise  from  the  multiple  ways  in 
which  words  and  sentences  are  put  together  to  convey  meanings  and  shades 
of  meaning  --  i.  e.  ,  to  represent  ideas  and  concepts.    Research  into  this 
problem  --  drawing  upon  psychological  and  logical  analysis  --is  scarcely 
begun.  "  i_/ 

A  third  core  problem  is  the  proper  choice  of  appropriate  selection  criteria  if 
condensed  representations  of  document  content  must  be  used  for  scanning,   search,  and 
relevance  decisions.    Swanson  suggests  that  the  price  paid  for  brevity  of  representation 
so  that  searching  operations  can  be  efficiently  managed  is  the  loss  of  at  least  some, 
perhaps  most,  of  the  information  in  a  collection  or  library.    He  notes  also  that: 

"It  is  another  obvious  but  seldom  remarked  fact  that  the  extent  of  such 
information  loss  for  existing  libraries  is  not  only  unknown  but  has  never 
defined  in  measurable  terms.  " 

This  loss  is  lived  with,  today,  in  many  practical  situations  involving  abstracts,  index 
term  sets,   selective-dissemination  notices,  and  even  mere  author-title  listings  in 
announcement  bulletins  or  search  output  products  from  either  manual  or  machine 
searches.    Yet  the  sheer  increase  in  volume  of  the  total  number  of  items  to  be  covered 
and  of  the  number  of  items  potentially  responsive  even  to  a  single  individual's  interests 
has  severely  stretched  any  individual's  capacity  to  scan  or  skim,  much  less  read,  the 
presumably  pertinent  material  --  documents  themselves,  abstracts  of  other  documents, 
listings  of  documents  available  --  already  accumulating  on  his  desk. 

Condensation,   reductive  representation,  becomes  more  and  more  imperative. 
Concurrently,  while  conventional  tools  may  be  lived  with,   after  a  fashion,  the  sub  - 
stitution  of  machine- compiled  or  machine-produced  alternatives,   even  though  they  give 
the  same  information  in  the  same  volume,  number  of  pages  to  be  scanned,  may  because 
of  such  things  as  inferiorities  of  page  and  line  formatting,   size  of  type  on  the  page, 
limitation  of  typography  to  upper  case  and  a  few  other  symbols,  make  the  problem  of  how 
adequate  the  user  judges  the  selection  and  condensation  to  be,  that  much  worse. 

A  fourth  problem  in  evaluation,  therefore,  is  the  question  of  whether  or  not  the 
benefit  to  users  is  worth  the  cost.    For  example,  despite  the  arguments  for  concept 
rather  than  word  indexing,  for  assignment  of  labels  rather  than  mere  extraction  of  a  few 
words  used  by  the  author  himself,  at  least  some  data  on  the  use  made  by  scientists  of 
various  sources  of  information  on  material  which  might  be  of  interest  to  them  suggests 


y 

Baxendale,   19  62  [42]  ,  p.  68. 
Swanson,  I960  [582],  pp.  5-6. 
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that  subject  indexes  are  not  the  most  important  source,  nor  even  a  major  source.  Herner 
found,  for  example,  that  only  about  16  percent  of  his  respondents  reported  use  of  indexes 
and  abstracts  as  primary  tools  in  literature  searches.    He  reports,  for  the  use  of  tools  in 
becoming  aware  of  current  sources  of  information,  477  of  38  32  responses  indicating  the 
use  of  indexing  and  abstracting  publications  as  against  486  using  footnotes  or  other  cited 
references,  ]_l  291  using  library  acquisition  lists,  and  212  using  separate  bibliographies 
(Herner,   1958  [265]). 

These  data,  and  similar  findings  of  Fishendon  that  17  percent  of  scientists  queried 
considered  the  scanning  of  titles  in  accession  lists  and  announcement  bulletins  a  principal 
means  to  find  information  of  interest,  A.I   suggest  that  KWIC  type  indexes  may  be  adequate 
for  many  purposes.    On  the  other  hand,  the  KWIC  index  to  the  U.  S.  Government  Research 
Reports  made  available  to  the  public  on  an  experimental  basis  through  the  Office  ot 
Technical  Services  was  discontinued  after  a  year  of  subsidized  operation  because  too  few 
of  the  users  indicated  willingness  to  pay  a  fee  in  order  to  have  the  service  continued  on 
a  subscription  basis. 

The  evaluational  problem  here  involves  the  lack  of  information  on  indexing  costs, 
the  relatively  few  quantitative  and  objectively  validated  studies  that  have  been  made  of 
user  needs,  the  question  of  whether  what  the  user  says  he  does  or  wants  is  what  he  really 
wants  or  does,  and  the  matter  of  defining  "interest"  for  different  users  with  differing 
purposes  and  requirements.    The  concept  of  "interest"  is  taken  to  mean  the  motivations 
of  a  particular  user  or  group  of  users  at  a  particular  time,  while  the  equally  imprecise 
notion  of  "relevance"  refers  to  the  value  judgments  made  by  the  user  as  to  the  relation 
of  an  item  to  his  query  or  interest. 

A  final  core  problem,  then,  is  that  of  the  question  of  relevancy  itself,  involving 
recognition  that  "relevancy  is  a  comparative  rather  than  a  qualitative  concept  .  .  .  (and) 
.  .  .  that  a  docimment  of  little  relevancy  in  the  eyes  of  X  might  well  be  highly  relevant 
in  the  eyes  of  Y.  "       Mooers  states,  similarly,  that: 

"There  is  no  absolute  'Relevance'  of  a  document.    It  depends  upon  the  person 
and  his  background,  the  work  and  the  date.    What  is  not  relevant  today  may  be 
relevant  tomorrow.  "  — 

Good  discusses  various  possible  measures  of  'relevance'  -  logical  measures,  frequency 
measures,   references  to,   citations  of,  interest  measu res ,  linguistic  measures,  1./ 

17 

Note  that  Herner's  data  and  those  of  Glass  and  Norwood,   195H  [  232],  reporting 
6.9  percent  use  of  cross-citations  in  another  paper  as  the  method  of  learning 
of  important  work  as  against  1.2  percent  using  an  indexing  service,  appear  to 
re- enforce  the  claims  of  those  who  advocate  citation  indexing. 

Fishenden,   1958  [  197],  p.  163. 

Bar-Hillel,   1959  [33],  p.  4-8.4. 

±1 

Mooers,   1963  [423],  p.  2. 

5/ 

Good,  1958  [234],  pp.  7-9. 
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but  except  for  the  obvious  statistical  criteria,  the  problems  of  how  to  measure  relevancy- 
remain  largely  unresolved. 

At  least  some  data  on  the  variability  of  relevance  judgments  is  available  in  reports 
of  the  performance  of  an  SDI  (Selective  Dissemination  of  Information)  system.    In  such 
systems,  the  indexing  terms  or  tags  assigned  to  a  new  item  are  compared  with  a  file  of 
"user-profiles"  that  is,  with  a  pre-prepared  listing  of  terms  or  topics  in  which  a 
particular  user  is  interested.    Where  the  term-profile  of  a  new  item  matches  that  of  a 
user,  a  notification  of  the  acquisition  of  that  item  is  sent  to  him.     Barnes  and  Resnick 
report  tests  of  such  a  system  in  which  pseudo-notifications  selected  randomly  were 
included  with  those  produced  from  the  matching  procedure.    Account  was  kept  of  which 
notices  were  regarded  by  the  users  as  meeting  their  interests  and  which  were  not.  They 
found  that  58.  1  percent  of  the  non- random  notifications  were  regarded  as  relevant,  but 
that  so  also  were  26.  8  percent  of  the  random  ones,  i.^ 

Katter  comments  on  findings  that  the  intersubjective  agreement  of  typical  users 
with  respect  to  value  judgments  of  condensed  representations  of  text  is  low.  He 
suggests: 

"One  source  of  this  low  intersubjective  agreement  among  users  may  be  that  it  is 
often  not  clear  what  is  intended  by  the  words  relevant  and  representative.  Con- 
siderations such  as  the  validity  of  the  material,  its  usefulness,  stylistic  qualities, 
understandability,  conceptual  preferability ,  etc.  ,   can  all  enter  their  judgments  in 
unknown  amounts.  "  ^1 

Corroborating  evidence  is  available  from  other  sources.    Swanson,  in  his  tests  of 
a  natural  language  text  searching  technique,  had  first  used  subject  matter  specialists  to 
rate  the  relevance  of  each  of  the  text  documents  to  each  of  50  questions.    Two  individuals 
rated  each  item,  and  if  they  disagreed  significantly,  a  third  person  was  asked  to  reconcile 
the  difference.      In  spite  of  this,  8  percent  of  the  cases  of  failure  to  retrieve  "relevant" 
documents  were  ascribed  to  incorrect  initial  judgments  of  relevance,  and  15  percent  of  the 
presumably  "irrelevant"  documents  were  finally  judged  to  be  relevant  after  all  (Swanson, 
1961  ^586  ]  ).   In  Swanson's  words:    "The  question  of  formulating  criteria  for  judging  the 
relevance  of  any  document  to  the  motive,  purpose,  or  intent  which  underlies  a  request  for 
information  is  profound  and  lies  at  the  heart  of  the  matter.  "  —I 


Barnes  and  Resnick,  1963  [  36]  ,  p.  2. 
Katter,  1963  [308],  p.  24. 

3/ 

Swanson,   I960  [  587]  ,  p.  1099. 
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7.2     Bases  and  Criteria  for  Evaluation  of  Automatic  Indexing  Procedures 

What  should  the  bases  be  for  the  evaluation  of  existing  or  proposed  indexing  systems 
ij  that  rely,  to  a  greater  or  lesser  extent,  on  machine  generation  of  the  indexing  or  classi- 
'  ficatory  labels?    Since  the  evaluation  of  quality  of  indexing  per  se  raises  such  fundamental 
and  elusive  questions,   can  these  questions  be  begged  for  the  case  of  automatic  indexing  as 
j  they  are  in  fact  for  almost  all  manual  systems?    If  so,  the  obvious  bases  are  those  of 
I  time,  cost,  availability  of  alternative  possibilities,  and  customer  acceptance.    Here  again 
j  we  are  faced  with  a  dearth  of  objective  data,  even  for  the  intercomparison  of  any  two 
manual  systems. 

In  the  two  years  preceding  the  ICSI  Conference,  the  Program  Committee  openly 
solicited  papers  that  would  provide  comparative  data  for  operating  information  systems 
and  that  would  develop  and  discuss  criteria  for  the  comparison  of  systems,  i.^  Never- 
theless,  of  the  papers  received  only  two  were  responsive  to  this  invitation:    the  special 

!j  case  of  comparing  the  conventional  file  against  the  inverted  file  approach  to  the  searching 
of  chemical  structure  data  (Miller  et  al,   1959  [419]),  and  an  early  report  by  Cleverdon 

J  on  the  ASLIB  Cranfield  project  for  the  intercomparison  of  indexing  systems,  under  a 

f  grant  from  the  National  Science  Foundation  (1959  [126]). 

There  had  been  an  earlier  comparative  experiment,  generally  conceded  to  be  the 
first  of  its  kind,  2^/  in  which  98  search  requests  were  run  by  ASTIA  personnel  using  a 
j  conventional  catalog  and  by  personnel  of  Documentation  Inc.  ,  using  a  coordinated  Uniterm 
^  index.    Warheit  says: 

"Unfortunately,  the  conditions  of  the  test  were  very  poorly  designed  so  that, 
in  the  final  analysis,  each  group  was  the  sole  judge  both  of  the  scope  of  the 
original  request  and  of  the  adequacy  of  the  bibliographies  produced.  The 
resulting  claims  are  of  course    contradictory.  "  _^ 

See  "Proposed  Scope  of  Area  4,  "  Proceedings,  ICSI,   1959  [48l],  pp.  665-669. 

2/ 

Compare,  for  example,  Gull,   1956  [246],  p.   329:    "When  one  considers  that  a 
fairly  thorough  search  of  the  literature  indicates  that  this  comparison  of  two 
reference  systems  is  the  first  undertaken  so  far,  it  is  not  surprising  that  the 
results  reveal  clerical  errors  and  an  incomplete  design  of  the  test.  " 

113/ 

Warheit,   1956  [631],  p.  274. 
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However,  some  of  the  findings  are  pertinent  to  our  present  questions  of  evaluation. 
Thus,  of  492  items  selected  by  Documentation,  Inc.  ,  that  ASTIA  considered  pertinent  but 
had  not  selected,  98  were  missed  by  them  although  the  proper  subject  heading  was 
searched  and  the  catalog  card  had  adequate  selection  clues,  89  were  missed  because  not 
all  applicable  subject  headings  were  searched,   21  were  missed  because  the  original 
subject  heading  assignments  had  been  inadequate,   7  were  missed  because  neither  title  nor 
abstract  provided  indication  that  the  report  itself  was  pertinent  to  the  request,  and  102 
were  missed  "because  the  subject  heading  did  not  occur  to  the  searcher  or  because  there 
were  so  many  cards  under  the  subject  heading  that  the  searcher  was  discouraged",  j./ 
Similarly,  Gull  reports,  of  318  items  selected  by  ASTIA  that  Docvunentation,  Inc. 
personnel  considered  relevant  but  had  not  themselves  selected,  97  were  missed  because 
the  searcher  did  not  consult  the  proper  terms. 

7.2.1     The  Cranfield  Project 

The  inauguration  of  the  Cranfield  project  is  itself  indicative  of  a  prior  lack  of 
objective  standards  as  applied  to  the  measurement  of  effectiveness  of  information 
indexing,   selection  and  retrieval  systems.  ^/  Beginning  in  1957,  and  still  continuing  with 
respect  to  individual  indexing  devices  such  as  synonym  controls  and  role  indicators,  this 
work  has  attempted  to  compare  different  indexing  systems  (e.  g.  ,  UDC,  Uniterm,  etc.  ) 
under  different  indexing  conditions  (e.  g.  ,  type  of  training  of  indexer,  length  of  time 
allowed  to  index)  against  proposed  measures  of  "retrieval  effectiveness".  These 
measures  are,   respectively,  the  recall  ratio,   or  the  percentage  of  relevant  documents 
retrieved  as  against  the  total  number  of  relevant  documents  known  to  be  in  the  collection, 
cind  the  relevance  ratio,  or  the  percentage  of  relevant  documents  among  those  actually 
retrieved. 

In  the  first  Cranfield  tests,  on  18,  000  documents,  it  is  reported  that  the  recall  ratio 
ranged  between  75  and  85  percent  for  all  four  indexing  systems.  ^1   These  results  are 


Gull,  1956  [  246],  p.  329. 

Compare,  for  example,  Randall,   1962  [492],  pp.   380-381:    "Prior  to  1957,  the 
proponents  of  the  various  indexing  and  classification  schemes,  the  universal 
decimal  system,  the  alphabetic  subject  heading,  the  Uniterm  system  and  faceted 
classification  touted  their  own  system  on  the  bases  of  subjective  evaluation  and 
theoretical  investigations.     There  were  many  claims  and  much  supposition  about 
the  relative  merits  ajid  benefits  .  .  .  but  there  was  no  body  of  data  from  which  an 
objective  evaluation  could  be  made.  .  .Many  observers  believe  that  the  Cranfield 
study  constitutes  the  most  important  work  done  in  the  field  of  cataloging  in 
recent  times.  " 

3/ 

Cleverdon,  et  al,   1964  [l30],  p.  87. 
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rather  better  than  reported  by  others  —  and  have  been  subjected  to  specific  criticisms 
although  these  first  tests  were  limited  to  the  recall  of  the  source  documents  on  which 
the  test  questions  were  based.    For  non-source  documents  there  would  of  course  also 
be  questions  relating  to  the  core  problem  of  how  relevance  is  to  be  judged.    Thus  Markus 
says: 

"Despite  investigations  by  Cleverdon  in  England,  and  by  many  others,  there  is 
today  no  generally  accepted  method  of  comparing  the  effectiveness  of  different 
types  of  indexes.  The  needs  of  index  users  vary  so  greatly  that  even  the  most 
carefully  planned  tests  of  retrieval  efficiency  can  be  challenged.  " 

Notwithstanding  such  criticisms,  however,  and  in  spite  of  the  fact  that  the  Cranfield 
tests  have  so  far  been  directed  principally  to  indexing  systems  applied  manually,  certain 
findings  and  conclusions  reached  by  Cleverdon  and  his  associates  are  pertinent  to  the 
questions  of  evaluating  automatic  indexing  procedures.    Examples  are: 

"The  fact  is  that  no  indexing  sleight  of  hand,  no  indexing  skill,   can  produce  a 
system  in  which  a  figure  for  recall  can  be  improved  substantially  without 
weakening  the  over-all  relevance,  i.  e.  ,  the  number  of  documents  that  are 
really  relevant  compared  with  the  total  number  retrieved. 

"The  majority  of  the  failures  (60  percent)  were  due  to  inadequacies  and  in- 
accuracies (carelessness  rather  than  lack  of  knowledge)  in  the  indexing  process. 
However,  supplementary  tests,  in  which  the  staff  of  outside  organizations  carried 
out  the  indexing  revealed  that  the  Cranfield  indexers  were  achieving  a  standard 
above  average.    This  seems  to  indicate  a  certain  inevitability  of  human  weakness 
and  error  in  the  indexing  process  and  lends  some  support  to  the  many  current 
research  projects  that  are  investigating  the  feasibility  of  automatic  indexing.  "  — 

7.2.2    O'Connor's  Investigations 

As  O'Connor  has  cogently  observed  on  a  number  of  occasions,  the  question  of 
whether  or  not  automatic  indexing  is  possible  is  not  the  real  question.    Rather,  the 
problem  is  whether  or  not  indexing  by  machine  is  capable  of  producing  results  that  are 
"good  enough"  for  retrieval  purposes,  raising  in  its  turn  the  still  more  basic  question  of 
how  "good  retrieval"  can  be  evaluated.    His  own  approach  in  detailed  investigations  has 


II 

See,  for  example,  Johnson  1962  [  300]  ,  p.  90:    "The  amount  of  meaningful 
information  that  can  be  retrieved  is  too  small.     There  are  few  available  studies 
on  this  subject.    But  these  seem  to  indicate  that,  under  some  indexing  schemes, 
meaningful  retrieval  can  run  as  low  as  10  and  15  percent  and  that  the  most  that 
caji  be  optimized  for  any  of  them,  even  under  highly  motivated  conditions,  is 
around  70  percent.  " 

Markus,   1963  [394],  p.   16.    See  also  Kochen,   1963  [327],  p.  12:    "The  out- 
standing large-scale  ajid  realistic  experimental  work  is  that  of  Cleverdon. 
Unfortunately,  his  results  are  not  very  decisive.  " 

3/ 

Cleverdon  et  al,   1964  [  130],  pp.  86-  87. 
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been  to  study  an  existing  system  (e.g.  ,  using  Merck,  Sharp  and  Dohme  data)  with  respect 
to  indexing  terms  such  as  "penicillin,"  "toxicity,"  and  "mode  of  action."    He  then 
attempts  to  define  various  possible  machine  assignment  rules,  and  then  to  determine  the 
probable  over-and-under  assignments  that  would  result  from  the  application  of  these 
rules . 

Typical  results  pertinent  to  both  questions  of  word-indexing  evaluation  and  of  inter- 
indexer  consistency  showed  that  for  23  documents  indexed  under  the  term  "toxicity,  "11 
did  not  contain  the  stem  "toxi.  .  .  "  at  all;  that  17  items  indexed  under  "penicillin"  contained 
the  word  at  least  once;  that  none  of  34  randomly  selected  documents  not  indexed  under 
"penicillin"  contained  the  word,  but  that  7  of  28  items  not  so  indexed  but  selected  as 
probable  candidates  from  title  and  other  clues  did  contain  the  word.  (O'Connor,  1961 
[447]) 

Typical  suggestions,  comments,  and  conclusions  made  by  O'Connor  include  the 
following: 

"It  might  be  required  that  the  mechanized  indexing  permit  as  good  (or  no  worse) 
retrieval  as  existing  human  indexing,  because  it  is  desired  to  free  the  subject- 
skilled  indexing  personnel  for  other  work.    Or  poorer  retrieval  (than  possible 
with  human  indexing  such  as  is  presently  done  of  comparable  material)  might  be 
accepted  from  computer  indexing,  because  poorer  retrieval  is  better  than  none 
and  there  is  a  shortage  of  subject- skilled  people  to  do  the  additional  indexing.  "  — 

"Such  considerations  as  the  following  are  relevant.    Over-assigning  can  increase 
input  costs  and  storage  (to  an  extent  dependent  on  the  storage  system),  but 
mechanizing  indexing  might  be  worth  the  cost.    Over-assigning  might  also 
increase  the  number  of  irrelevant  documents  retrieved,  but  the  increase  might 
be  insignificant.  "  ^1 

".  .  .Suppose  terms  A,  B,  and  C  each  correctly  characterize  five  percent  of  a 
ten  thousand  document  collection,  each  term  is  overassigned  to  another  five 
percent,  and  over-assignment  of  each  term  occurs  independently  of  the  correct 
assigning  and  over-assigning  of  the  others.    Then  about  nine  documents  will  be 
extra  for  the  search  question  A  &  B  &  C.  "  .l'^ 

"The  question  of  permitting  some  under-assigning,  that  is,  the  computer  failing 
to  assign  [  a  term]  T  to  some  document  which  should  have  it,  is  more  delicate. 
H\iman  indexers  sometimes  underassign.    If  we  knew  the  rate  of  o\inderas signing 
by  human  indexers  for  a  term  T,  we  might  consider  allowing  the  computer  a 
similar  rate.    However,  some  cases  of  underassigning  might  be  more  important 
than  others  and  if  the  computer  made  more  important  mistakes  than  the  human 
indexers,  retrieval  might  not  be  'good  enough'."  ^/ 

Tl 

O'Connor,   I960  [444],  p.  3, 
O'Connor,   1961  [  448],  p.  199. 
O'Connor,   I960  [444],  p.  6. 

4/ 

Ibid,  pp.  6-7. 
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Other  typical  points  made  by  O'Connor  include  the  possibilities  that  the  use  of 
automatic  indexing  techniques  might  free  trained  technical  people  for  other  work,  that  it 
might  permit  more  indexing  than  is  now  possible  with  available  resources,  that  it  might 
cost  less,  and  that  it  might  produce  a  better  or  more  consistent  indexing  product,  l.'' 
With  respect  to  the  latter  point,  however,  he  points  out  that  greater  consistency  might  not 
in  itself  be  a  virtue,   since  the  product  although  generated  more  consistently  might  be 
relatively  worthless  by  comparison  with  the  inconsistent  human  product.  Especially 
pertinent  to  the  question  of  judgment  factors  in  evaluation  was  a  comparison  of  the  most 
frequent  words  selected  by  the  Luhn  "auto- encoding"  technique  as  applied  to  an  ICSI  paper 
against  a  quasi-random  word  list    for  the  same  paper  produced  by  selecting  the  last  non- 
common  word  on  every  page,  and  the  first  such  word  on  every  second  page.    He  remarks: 

"The  important  point  of  this  quasi-random  list  for  my  present  purposes  is  to 
emphasize  that  first  impressions  might  not  be  at  all  a  good  way  of  judging 
the  adequacy  of  an  index  set.  "  —I 

7.  2.  3    Questions  of  Comparative  Costs 

The  paucity  of  objective  data  on  the  effectiveness  of  indexing  systems  generally 
extends  to  even  such  obvious  questions  as  costs  of  indexing  and  time  required  to  index. 
These  very  questions  might,  in  fact,  be  decisive  with  respect  to  choice  between  manual 
and  machine  systems.    It  has  been  estimated  by  some  that  the  costs  of  manual  subject 
indexing  amount  to  close  to  75  percent  of  the  costs  of  operating  an  information  selection 
and  retrieval  system,  ^/  yet  very  little  actual  data  on  costs  has  been  reported  in  the 
literature.         Exceptions  are,  for  the  most  part,  limited  to  rather  special  cases,  such 
as  the  following  examples: 

I.  A  total  cost  of  less  than  $30,  000  is  reported  for  a  10,  000  document 

collection  at  Aeronutronic.    Four  man-years  of  effort  were  required. 
On  average,  12.  6  access  points  were  provided  per  document,  of  which 
9.2  were  subject-indicating  descriptors  chosen,  with  some  modifications, 
from  the  second  Edition  of  the  ASTIA  Thesaurus.    "This  favorable  figure 
was  possible  because  an  adequate  ready-made  thesaurus  of  indexing  terms 
was  available  and  because  the  'peek-a-boo'  type  equipment  used  was  much 


O'Connor,  1962  X447],   p.  267. 
O'Connor.  1963    [443j,  p.  16. 

3/ 

O'Connor.  1962  l447j.  p.  270. 

4/ 

O'Connor,   1963  [442],  p.  1. 

5/ 

See,  for  example,  A.  D.  Little,  Inc.  {  1963  [23] ,  p.  5):    "Performance  and  cost  data 
on  existing  large  documentation  systems  are  surprisingly  sparse,  and  cost  data 
have  rarely  included  adequate  overhead  and  depreciation  accounting.  " 
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less  expensive  than  most  other  devices  offering  comparable  speed 
of  operation  and  search  logic  possibilities.  "  JL' 


2.  "The  experience  of  libraries  that  have  gone  through  indexing  using 
links  and  role  indicators  and  careful  editing  shows  that  indexing  takes 
about  one-half  hour  per  document  (or  $4.  00)  and  costs  an  additional 
$1.  00    for  routine  processing.  "  ^/ 

3.  In  an  investigation  of  the  comparative  merits  of  manual  indexing  of  2,  000 
documents  using  the  UDC  classification  system  as  against  a  KWIC  index, 
Black  gives  the  figure  of  approximately  $1400  for  the  UDC  case  compared 
to  about  $600  for  an  in-house  computer  operation  to  produce  KWIC  listings, 
and  somewhat  more  for  a  KWIC  index  compiled  by  a  service  bureau.-^' 

Time  required  to  index,  which  directly  involves  cost,  is  reported  by  Cleverdon  to 
vary  widely: 

"Few  reliable  figures  have  been  given  for  current  practices,  although  a  particularly 
high  figure  is  the  11/2  hours  average  quoted  for  indexing  reports  for  the  catalogue 
of  aerodynamic  data  prepared  by  the  Nationaal  lluchtvaart  laboratorium  in 
Holland.    It  appears  from  personal  discussions  that  an  average  of  20  minutes  for 
a  general  collection  of  technical  reports  is  the  top  limit,  and  this  has  been  taken 
as  the  maximum  indexing  time  to  be  used  in  the  project.  "  ^/ 

Insofar  as  such  meagre  data  is  indicative,  there  does  not  appear  to  be  any  particular 
cost-advantage  for  machine- compiled  and  machine-generated  indexing  other  than  the  title- 
only  KWIC  indexes.    Thus,  Olmer  and  Rich  report,  in  part: 

"The  program  .  .  .  lends  itself  to  a  variety  of  applications.    One  of  these  ...  is 
estimated  to  cost  roughly  $4.  00  per  document  for  cataloguing,  putting  on  tape, 
printing  and  making  any  necessary  corrections.  "  ^/ 

This  is  for  a  case  where  the  indexing  (cataloging)  is  done  manually. 

For  a  specific  proposed  automatic  indexing  system,   employing  a  modified  version 
of  the  Luhn  word-frequency  counting  selection  principle,  Gallagher  and  Toomey  report 
that: 


y 

y 

5/ 


Linder,   1963  [36l],  p.  147. 

Lockheed  Aircraft  Corp.  ,  1959  [  369],  p.  93. 

Black,  1962    [ 65] ,  p.  318. 

Cleverdon,   1959  C 126],  p.  690. 

Olmer  and  Rich,   1963  [  454],  p.  182. 
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"For  the  documents  in  our  system,  we  estimate  that  processing  time  will  be 
about  20  seconds  per  thousand  words  .  .  .  The  cost  is  approximately  $3.  50 
per  minute  when  averaged  between  prime  and  extra  shift.  "  i_/ 

i  This  means  that  the  cost  of  processing  a  3,  000-word  document  would  be  $3.  50  ,  exclusive 
of  the  costs  of  keypunching  the  input  text  which,  conservatively  estimated,  costs  not  less 
than  1-2  cents  per  word.  ^/  Swanson  similarly  assumes  either  that  machine-usable  text 
is  already  available  or  that  editing  and  keystroking  efforts  are  separate  costs  in  arriving 
at  an  estimate  of  $1.  00  per  item  for  automatic  indexing.^/ 

These  quantitative  estimates  bear  out  the  more  subjective  conclusions  of  such 
j  investigators  as  Bar-Hillel,  O'Connor,  and  others.    Examples  are: 

"It  is  very  likely  that  manual  Uniterm  indexing  by  cheap  clerical  labor  will  still, 
on  the  average,  be  qualitatively  superior  to  any  kind  of  automatic  indexing,  and 
it  is  very  unlikely  that  the  cost  of  automatic  indexing  will  ever  be  less  than  this 
kind  of  manual  Uniterm  indexing,  unless  the  automatic  indexing  is  to  be  of  such 
low  quality  as  to  totally  defeat  its  purpose.  "  ^1 


2/ 


11 
5/ 


"Most  of  these  techniques  require  that  the  full  texts  of  documents  be  in  machine 
readable  form.    At  present  this  usually  requires  keypunching  which  is  much 
more  expensive  than  a  specialist's  indexing  efforts."  ^1 


Gallagher  and  Toomey,   1963  [205],  p.  52. 

'Compare,  for  example,  Ray,   1961  [496],  p.  55;  Swanson,   1962  [  584],  p.  470: 
The  cost  is  roughly  one  or  two  cents  per  word  which  by  standards  of  what  is 
normally  spent    for  even  the  most  thorough  indexing  and  cataloging,  is 
exorbitant."    Mersel  and  Smith  report  1964  [415],  p.   lOA)  typical  TRW  costs 
of  keypunching  as  two  cents  per  word  for  Russian  technical  text,  and  one  cent 
per  word  for  English.    They  also  cite  cost  figures  as  low  as  half  a  cent  per 
word  at  the  CIA-Georgetown  Keypunching  Center  in  Frankfurt  and  at  IBM,  but 
this  is  exclusive  of  overhead  and  computer  processing  (e.g.  ,  editing  program) 
costs,  so  that  the  one  cent  figure  appears  minimal  as  of  today.  However, 
Kochen  reports  (1963  [  327],  p.   7):    "While  keypunching  of  text  cost  roughly  one 
cent /word,  new  means  for  recording  spoken  (and  written)  text  using  a  steno- 
keyboard  tied  to  a  photodisc  storing  a  Stenocode -English  dictionary  could  possibly 
reduce  the  cost  to  1/3 -cent  per  word.  " 


Swanson,  1962  [584],  p.  471. 
Bar-Hillel,  1962  [35],  p.  418 
O'Connor,  1963  [443],  p.  1. 
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7.2.4    Summary:    Potential  Advantages  as  Bases  for  Evaluation 

In  view  of  the  difficulties  engendered  by  the  underlying  core  problems,  the 
criticisms  that  can  be  brought  against  tests  of  "retrieval  effectiveness",  the  general  lack 
of  comparative  data  and  standards  of  measurement,  the  question  of  evaluation  of  automatic 
indexing  procedures  largely  reduces  to  the  weighing  of  potential  advantages  and  disadvan- 
tages.   In  the  case  of  such  procedures  as  KWIC  and  citation  indexing,   some  of  these 
possibilities,  both  pro  and  con,  have  been  discussed  previously.    In  general,  suggested 
bases  for  evaluation  reflecting  operational  considerations  may  be  siimmarized  as  follows: 

1.  Speed  and  timeliness 

2.  Relative  economy 

3.  Consistency  and  reliability—'^ 

4.  Elimination  of  the  need  for  further  htunan  intellectual  effort  after 
initial  planning  and  programming  has  been  done. 

5.  Providing  a  product  that  could  not  otherwise  be  obtained. 

2/ 

6.  Ease  of  updating  and  revision  of  indexes  so  produced.— 

From  the  point  of  view  of  possible  operational  advantages,  these  may  be  combined 
into  the  single  criterion: 

The  achievement  of  a  more  effective  and  more  economical  balance  between  the 
meeting  of  the  objectives  of  the  indexing  system  and  the  utilization  of  available 
resources. 


Compare  McCormick,   1962  [409],  p.   182:    "A  computer  is  objective  in  its  operations 
and  it  can  be  repetitive.    If  given  a  certain  amount  of  information  about  a  document, 
it  is  always  able  to  index  the  document  in  a  consistent  manner.    This  consistency  is 
desired  so  as  to  avoid  the  situations  where  a  person  might  index  a  document  differ- 
ently on  various  occasions,  or  where  it  would  be  indexed  differently  by  another 
person  when  there  appears  to  be  no  good  reason  for  a  difference.  "   Note,  however, 
O'Connor's  point  previously  mentioned,   (1963  [443],  p.    16):    "It  has  been  argued 
that  mechanized  indexing  has  the  advantage  of  consistency.  .  .    However  this  argu- 
ment by  itself  says  very  little  in  favor  of  mechanized  indexing.    For  two  humanly 
produced  index  sets  for  a  document  which  differ  somewhat  may  both  be  quite  useful, 
though  imperfect,  while  the  index  set  which  the  same  program  will  always  reproduce 
for  the  same  document  may  be  worthless.  " 

See,  for  example,  Youden,  1963  [658],  p.  332: 

"The  facility  with  which  indexes  may  be  updated  and  the  ease  of  selecting  items  for 
special  bibliographies  will  result  in  the  majority  of  indexes  being  computer  produced 
before  many  years.  " 
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However,  the  question  of  the  objectives  of  the  system  brings  us  back  full  circle  to  the 
questions  of  purpose  in  terms  of  particular  requirements,  of  quality,  and  of  how  to 
measure  either  purpose  or  quality.    Thus  we  may  determine  that  an  automatic  indexing 
procedure  produces  a  product  at  least  as  rapidly,  at  least  as  inexpensively,  at  least  as 
consistently  as  human  indexing  operations  would,  and  with  substantially  less  investment  of 
manpower  resources.    However,  will  this  product  be  as  useful  or  as  "good"  as  the  human 
product? 

In  view  of  the  many  caveats  about  the  present  quality  of  indexing  systems—'''  and  the 
lack  of  standards  for  measuring  quality,  2/  it  is  important  to  recognize  that  we  should 
compare  the  products  of  automatic  indexing  methods  "not  with  hand-crafted  excellence,  but 
with  the  average,  the  routine  output  of  the  over-burdened  subject  analyst  working  with  the 
deficiencies  of  any  other  indexing  system".  3/  Such  deficiences  include  the  critical 
question  of  how  well  and  how  consistently  the  system,  whatever  it  is,  is  applied  in  practice 
by  the  human  analysts. 

7.3      Findings  with  Respect  to  Inter -Indexer  and  Intra -Indexer  Consistency 

Very  few  objective  studies,  despite  the  obvious  relationship  to  the  general  questions 
of  quality,  pertinency,  and  reliability  of  indexing,  have  as  yet  been  made  of  inter-indexer 
and  intra-indexer  consistency.    Perhaps  the  first  investigation  both  to  obtain  experimental 
data  and  to  analyze  the  observed  types  of  failures  to  achieve  correct  assignments  was  that 
of  Lilley.  4/  He  took  the  answers  made  to  6  questions  by  340  students  entering  a  graduate 
library  school,  wherein  they  were  asked  to  write  down  the  subject  headings  which  they 
would  expect  to  be  applied  to  other  books  on  the  same  subject  as  6  "sample  books"  in  a 
system  such  as  the  Library  of  Congress  card  catalog.    Lilley  reports: 


\J       See,  for  example,  in  addition  to  comments  by  O'Connor  and  others  previously  quoted, 
Helyar,   I96I  [262],  p.   110:    "The  general  current  of  feeling  of  the  meeting  as  re- 
flected both  in  the  papers  and  in  the  discussion  is  that  the  standard  of  indexing  is  not 
nearly  adequate;"    Artandi,       1963  [22],  p.   1.:    "...  'Good  indexing'  as  such  has 
not  been  defined  satisfactorily  and  is  the  function  of  many  variables,   some  known,  _ 
others  not  yet  identified";  Tritschler,    1963  [610],  p.  5:    "...   'Good' indexing  is  ex- 
tremely difficult  to  describe  and  'perfect'  indexing  is  impossible  to  define  or 
measure.  " 

Zj      See  Cleverdon,   I960  [124],  p.  429:    "The  most  important  requirement  in  information 
retrieval  is  a  recognized  standard  of  measurement  and  after  that  we  need  a  satis- 
factory method  of  measuring.    Only  when  these  have  been  found  will  it  be  possible  to 
know  for  certain  whether  any  new  system  of  indexing  or  retrieving  information  is  an 
improvement  on  previous  methods.    At  present  all  those  trying  to  solve  the  problems 
of  information  retrieval  are  working  very  much  in  the  dark,  uncertain  as  to  the  real 
problems  and  quite  unable  to  apply  any  measurements  to  their  proposed  solutions.  " 

3/       Kennedy,   1962  [311],  p.  126. 

4/       Lilley,   1954  [360]:   See  also  Vickery,   I960  [626],  p.  4. 
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"A  total  of  2245  headings  were  suggested,  averaging  1.  1004  headings  per  book  per 
student.    These  headings  represented  3  73  different  varieties,   of  which  368  were 
different  from  the  headings  traced  on  the  Library  of  Congress  cards  for  the  sample 
books.  .  .    As  an  average  62.  17  different  headings  were  suggested  for  each  book.  .  . 

"When  the  368  different  varieties  of  incorrect  headings  were  analyzed  in  accordance 
with  certain  criteria  that  had  been  set  up,  it  was  found  that  incorrect  specificity  was 
a  factor  in  93.  48%,  incorrect  terminology  in  79.  08%  and  incorrect  form  of  entry  in 
72.  28%  of  the  headings.  .  .    Over  half  of  the  incorrect  headings  (54.  62%)  had  some 
combination  of  two  errors,  and  almost  half  (49.  75%)  could  have  been  converted  into 
'correct'  headings  only  by  changing  the  level  of  specificity,  and  by  revising  the  term- 
inology, and  by  altering  the  form.  .  . 

"It  was  also  found,   contrary  to  the  general  assumption  that  failure  in  specificity 
almost  always  means  that  the  reader  is  approaching  his  subject  from  too  broad  a 
point  of  view,  that  of  those  headings  in  which  an  incorrect  level  of  specificity  was  a 
factor.  .  .    64.  82%  were  too  broad  and  35.  18%  were  too  narrow.  "  1/ 

Lilley  then  asks  the  rather  plaintive  question  as  to  what  would  happen,  given  that  his  quite 
homogeneous  group  of  subjects,  all  of  them  college  graduates  and  all  seriously  interested 
in  librarianship,  could  come  up  with  more  than  62  different  headings,  on  average,  for 
every  heading  actually  used  in  the  catalog,  if  his  test  group  had  included  a  larger  number 
of  subjects  with  more  heterogeneous  interests? 

In  1961,  Macmillan  and  Welt  investigated  the  duplicate  indexing  of  171  papers  in  a 
limited  area  of  the  medical  sciences  (1961  [389]).    In  only  18  percent  of  the  cases  was  the 
indexing  identical  or  nearly  so.    About  a  third  of  the  papers  had  been  indexed  so  differently 
that  there  was  no  common  correlation.    For  the  rest,  terms  were  used  in  one  case  that 
were  missed  in  the  other. 

Some  brief  data  on  inter-indexer  consistency  is  also  provided  by  Kyle  (1962  [342]) 
for  two  indexers  applying  her  classification  system  to  246  arbitraily  selected  French  and 
English  items  in  the  field  of  political  science.    Of  these,    160  were  indexed  the  same  way 
by  both  indexers,  for  a  consistency  figure  of  70  percent.    Tritschler  noted  that  no  items 
were  indexed  the  same  way  a  second  time  as  they  were  the  first,  in  small-scale  experi- 
ments involving  20  documents  independently  indexed  by  7  different  people.  2/ 

Painter  (1963  [460]),  in  her  study  of  problems  of  duplication  and  consistency  of 
subject  indexing  of  the  reports  handled  by  the  Office  of  Technical  Services,  proceeded  by 
selecting  items  from  the  announcement  bulletins  of  agencies  contributing  to  OTS,  having 
these  items  re-indexed  in  the  various  agencies,  and  comparing  the  results  with  the  origi- 
nal indexing  assignments.    At  ASTIA,   94  items  were  re-indexed,   with  1,  239  terms  having 
been  assigned  to  them  originally  and  1,  119  assigned  on  the  re-run.    Overall,   62  percent  of 
those  terms  originally  assigned  were  also  assigned  the  second  time,  and  69  percent  of  the 
second-time  terms  had  also  been  assigned  originally.    However,    111  of  the  starred  des- 
criptors (which  are  of  the  most  significance  in  the  ASTIA  system)  were  used  the  first  time 
and  not  the  second,  while  98  were  used  the  second  time  but  not  the  first. 


j./  Lilley,  1954  [360],  pp.  42  and  43. 
2/       Tritschler,   1963  [610],  p.  5. 
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At  AEC,   96  items  were  re -indexed  to  the  subject  heading  scheme  used  in  Nuclear 
Science  Abstracts .    There  had  been  249  headings  assigned  to  these  items  originally  and 
406  were  assigned  on  the  second  run,  for  an  overall  consistency  rate  of  54  percent,  but 
with  53  percent  of  the  headings  used  the  second  time  not  having  been  used  the  first.  The 
sample  checked  at  OTS  consisted  of  32  items  to  which  346  descriptors  had  been  assigned 
the  first  time  and  41S  the  second.    The  consistency  was  65  percent  with  respect  to  the  first 
run  and  54  percent  with  respect  to  the  second.    Finally,  at  the  National  Agriculture  Library 
I   99  items  were  checked,  with  results  showing  a  high  consistency  rating  and  a  similarity  of 
indexing  between  the  two  runs  of  86  percent.    Painter  concludes: 

"The  consistency  rates  are  not  encouraging.    Apparently  there  is  little  difference 
between  preparation  for  a  manual  system  and  that  for  a  machine  system.    The  per- 
centages indicate  that  there  is  no  significant  difference  between  consistency  where 
two  or  three  headings  are  assigned  and  where  twelve  or  sixteen  are  assigned. 
Therefore,  we  are  left  with  the  fact  that  regardless  of  these  variables,  consistency 
rates  range  between  60  and  72  per  cent.  "  1/ 

Jacoby  and  Slamecka  report  even  less  encouraging  data  (1962  [293]).    "In  general, 
the  inter-indexer  reliability  was  found  to  be  low  (in  the  vicinity  of  20  per  cent),  the  intra- 
indexer  reliability  somewhat  higher  (about  50  per  cent).  "   For  a  series  of  tests  of  indexing 
of  a  group  of  chemical  patents  by  three  experienced  and  three  inexperienced  indexers,  they 
found  that  the  beginners  had  average  matchings  among  the  terms  assigned  by  them  to  the 
same  documents  of  only  12.  6  percent  and  that  even  for  the  experienced  indexers  the 
average  percent  of  matching  terms  was  only  16.  3  percent.  2/  In  other  studies,  these  in- 
vestigators have  explored  the  effects  of  various  indexing  aids  upon  the  reliability  and 
consistency  of  indexing,  concluding  that  the  use  of  prescriptive  aids  such  as  authority  lists 
improves  reliability  and  inter-indexer  consistency  from  8  or  9  percent  to  33  percent,  while 
those  aids  such  as  thesauri  and  association  lists  "which  enlarge  the  indexer's  semantic 
freedom  of  term  choice"  are  detrimental  (Slamecka  and  Jacoby,    1963  [  560]). 

Rodgers  in  a  study  of  intra-indexer  consistency  reports  data  for  the  re -indexing,  by 
the  same  person  at  a  later  date,  of  60  documents  dealing  with  the  United  Arab  Republic 
taken  from  The  New  York  Times.    She  reports  that  the  average  consistency  over  all  60 
documents  was  59  percent.  3/  In  a  further  study  of  inter-indexer  consistency,  20  papers 
from  Area  5,  ICSI,  were  key-word  indexed  by  16  people  all  of  whom  were  familiar  with 
the  subject  matter,  (although  only  8  completed  all  20  papers).    Results  are  given  in  terms 
;  of  the  proportions  of  the  total  nximber  of  unique  words  chosen  by  100  percent  of  the  subjects 
(.  008)  half  of  them  (.  14)  and  only  one  of  them  (.  52).  4/  Study  of  the  results  in  terms  of 
the  proportion  of  words  selected  in  common  by  any  pair  of  these  indexers  to  the  total 
^  number  of  different  words  selected  by  them  both  gave  a  "grand  mean  agreement  for  all 
■  two-person  combinations  for  the  8  subjects.  .  .   [of].  .  24  percent  against  all  20  articles.  "_5/ 
'The  mean  percentage  of  overlap  between  Luhn's  word-frequency  selection  technique  (as 
j  applied  to  the  same  papers)  and  any  one  or  more  indexers  who  agreed  was  .  15. 

I  

'_!/       Painter,   1963  [460],  p.  94. 

Zj  Jacoby  and  Slamecka,   1962  [293],  p.  16. 

■3/  Rodgers,   1961  [504],  p.  12. 

4/  Rodgers,   1961  [503],  p.  50. 

bj  Greer,   1963  [239],  p.  10. 


159 


Still  further  studies  of  indexer  consistency  investigated  at  the  Information  Systems 
Operation  division  of  General  Electric  have  just  recently  been  reported  (Korotkin  and 
Oliver,   1964  [331,  332]).    In  particular,  the  investigators  report  on  the  effects  of  subject 
matter  familiarity  and  on  the  use  as  a  job  aid  of  a  reference  list  of  suggested  descriptors 
upon  inter-indexer  consistency.    The  material  for  test  consisted  of  30  abstracts  drawn 
from  Ps ychological  Abstracts,  to  be  indexed  by  5  psychologists  and  5  non-psychologists  in 
two  sessions,    with  and  without  use  of  the  "job  aid".    Results  in  terms  of  mean  percent 
consistency  were  reported  as  follows: 

Session  I  Session  II 

"Group  A  (Familiar)  39.  0%  53.  0% 

Group  B  (Non-familiar)  36.  4%  54.  0%"  _l/ 

Corroborating  evidence  of  a  generally  low  rate  of  inter-indexer  consistency  is 
provided  by  noting  instances  of  duplicated  indexing  that  may  occur  in  regularly  issued 
announcement  bulletins.    During  current  awareness  scanning  of  the  DDC  (ASTIA)  "TAB" 
in  recent  months,  members  of  the  staff  of  the  Research  Information  Center  and  Advisory 
Service  on  Information  Processing  have  caught  more  than  20  cases  of  duplicate  and  even 
triplicate  indexing  of  the  same  item.    (Two  examples  can  be  discovered  in  Figure  8  a  and 
b).    For  the  52  independent  assignments  involved,  for  these  items  the  average  inter- 
indexer  consistency  is  only  46.  1  percent. 

On  the  general  subject  of  indexing  consistency,  Black  comments  as  follows: 

"There  have  been  enough  experiments  to  indicate  that  there  is  no  consistency,  or 
very  little,  between  one  indexing  performance  by  a  given  individual  and  another 
indexing  performance,  at  a  later  date,  by  the  same  individual.    The  same  inconsis- 
tency has  been  discovered  among  different  individuals  all  indexing  the  same  docu- 
ments.   Thus  there  is  neither  inter-indexer  consistency  nor  intra-indexer  consis- 
tency in  any  system  that  depends  on  human  performance.  "  2/ 

There  can  be  little  doubt  that  the  quality  and  consistency  of  most  human  indexing, 
practically  available  today,  is  not  good.    Much  of  it,  because  of  time  and  other  pressures, 
is  either  directly  a  word-extraction  process,  or  it  is  inconsistent  in  assignment  of  many 
relevant  descriptors  and  subject  category  labels.    On  the  other  hand,  today's  indexing, 
whether  accomplished  by  man  or  machine,  is  probably  no  better  and  no  worse  than  any 
other  clas sificatory  or  indexing  procedures.    The  only  excuse,  therefore,  for  choice 
between  man  and  machine  is  the  cost/benefit  ratio  which  is  related  on  the  one  hand  to 
specific  operational  considerations  and  on  the  other  to  the  question  of  whether  or  not 
various  indexers,  and  various  users,  would  agree  with  the  machine  as  much  as  they  agree 
with  each  other. 

Before  turning  to  some  of  the  operational  considerations  affecting  the  cost-benefit 
ratio,  however,  certain  special  factors  should  be  briefly  mentioned. 

7.  4     Special  Factors  and  Other  Suggested  Bases  for  Evaluation 

The  difficulties  and  problems  of  evaluation  so  far  considered  are  generally  applicable 
to  any  indexing  system,  whether  manual  or  automatic.    Certain  special  factors  arise,  how- 
ever   when  we  consider  some  of  the  proposed  automatic  assignment  and  automatic  classi- 
fication techniques.    In  addition,  the  prospects  for  computer  processing  hold  at  least  the 

TJ       Korotkin  and  Oliver,   1964  [331],  p.  7. 
2/       Black,   1963  [64],  pp.  16-17. 
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Research  conducted  in  connection  with  the  classi- 
fication of  adverbials  produced  the  survey  pre- 
sented in  this  paper.     The  resulting  classifica- 
tion is  tentative  because,   among  other  reasons. 


Figure  8a.    Examples  of  Duplicate  Indexing 
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it  deals  only  with  data  of  a  limited  corpus. 
The  scope  of  the  problem  and  statements  by  some 
other  authors  are  presented.     The  procedure  of 
investigation  involved  a  study  of  adverbial 
sequences   and  occurrences  of  adverbials  in 
reference  to  verbals.     Four  classification  sort- 
ings were  used  to  aid  the  study.  Tentative 
adverbial  function  classes  were  assumed.  The 
results  of  the  first  three  sortings  were  used  to 
modify  the  tentative  function  classes.  Tentative 
position  classes  were  established.     The  fourth 
sorting  was  used  to  establish  function-position 
classes.  (Author) 
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information  processing  algorithms  which  will  be 
generally  applicable  to  natural  and  artificial 
languages.  (Author) 
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bution,  Scientific  research.  Population. 

In  Section  I,   the  moments  of  a  normalized  toler- 
ance distribution  are  estimated  by  utilizing  ex- 
perimental technique  deaths   in  the  indirect  as- 
say.    More  precisely,   the  information  gained  by 
assuming  that  the  probability  of  experimental 
technique  deaths   is   independent   of  dosage  may, 
in  general,   yield  an  LD50  with  greater  precision. 
Adjustments  are  given  for  nonconstant  natural 
mortality  over  time.     A  preliminary  report  on  bi- 
modal  tolerance  distribution  is  also  given. 
(Author) 
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promise  of  more  objective  measures  of  performance  or  quality  than  evaluative  techniques 
available  today. 

Examples  of  the  special  factors  involved  in  assignment  indexing  techniques  and 
automatic  classification  include  the  question  of  the  amount  of  computation  required  in  the 

I  inversion  and  other  manipulations  of  large  matrices  _l/  and  the  concommitant  problems  of 

!how  large  a  vocabulary  of  clue  words  can  be  used  effectively  and  of  whether  some  docu- 
ments cannot  be  indexed  at  all  because  they  contain  none  of  these  words.  2/   There  is,  as 

;  Needham  says,  "no  merit  in  a  classification  program  which  can  only  be  applied  to  a  couple 

'of  hundred  objects.  "  3/ 

In  the  various  techniques  for  automatic  clustering  or  categorization  of  documents, 
there  are  serious  questions  of  whether  the  groupings  can  be  conveniently  named  or  dis- 
played for  the  benefit  of  the  user.  4/  Another  example  of  special  factors  in  the  appraisal 
of  an  automatically  generated  classification  scheme  is  as  follows: 

"Operational  testing  is  displeasing  in  that  it  puts  off  any  verification  until  right  at  the 
end;  it  is  expensive;  there  is  not  much  experience  on  how  to  do  it  in  a  realistic  way; 
and  it  is  ill-controlled  in  the  sense  that  the  practical  performance  of  a  system  is 
influenced  by  many  other  factors  than  the  classification  it  embodies.  "  5/ 

Examples  of  suggested  bases  for  evaluation  made  possible  by  machine  processing 
itself  include  proposals  by  Doyle  and  Garvin,  among  others.    Doyle  in  particular  suggests 
jjthe  substitution  for  the  elusive  concept  of  "relevance"  of  criteria  based  on  "sharpness  of 
!j  separation  of  exploratory  regions  in  which  the  searcher  finds  documents  of  interest  from 
ilthose  in  which  he  does  not  find  such  documents.  "  6/  He  further  emphasizes  the  need  for 
!  I  discriminating  a  particular  docviment  from  other  topically  close  documents  (Doyle,  1961 

j]  [  166])  and  suggests  that  "this  decision  can  never  be  made  by  a  human  only  by  a  com- 

hputer,  which  is  the  only  agency  capable  of  having    full  consciousness  of  the  contents  of  a 
j  library.  "  7/  Garvin  considers  the  more  general  problems  of  language  and  meaning,  and 
I  suggests  that  there  are  two  kinds  of  "observable  and  operationally  tractable  manifestations 

I  of  linguistic  meaning",  namely,  translation  and  paraphrase,  and  that  these  may  be 

j  j  investigated  by  techniques  of  linguistic  data  processing.  8/   Edmundson,  however,  points 
ill  out  that  while  there  is  in  general  only  one  translation  of  a  document,  there  may  be  as  many 
■abstracts  (and,  by  implication,  index  sets)  as  there  are  users.  9/   Thus  we  are  back  again 
jat  the  questions  of  purpose  and  relevance. 

Compare  Williams,    1963  [642],  p.  162. 
\Zj       See  Maron  and  Borko,  various  references. 
1 3/       Needham,   1963  [433],  p.  8. 

14/      See,  for  example,  Doyle,   1963  [162],  p.  6:    "Several  researchers  have  tried  to 

I  group  topically  close  articles,  usually  by  statistical  means,  but  it  is  rather  difficult 

to  get  any  benefit  from  this  grouping  unless  you  can  represent  these  groups  for 

human  inspection.  " 

I,:  5/  Needham,   1963  [432],  p.  2. 

[6/  Doyle,    1963  [164],  p.  200. 

[7/  Doyle,    1961  [169],  p.  23. 

\y  Garvin,   1961  [224],  p.  137. 

[9/  Edmundson,   1962  [178],  p.  4. 
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8.    OPERATIONAL  CONSIDERATIONS 


Whatever  the  verdict  of  evaluation  of  one  or  more  automatic  indexing  techniques, 
whether  of  the  derivative,  modified  derivative,  or  assignment  type,  there  are  certain 
operational  considerations  and  problems  that  typically  affect  any  attempt  to  apply  such 
techniques  in  actual  production  operations.    These  considerations,  which  also  affect  lin- 
guistic data  processing  operations  in  general,  include  input  considerations,  availability  of 
methods  or  devices  for  converting  text  to  machine -usable  form,  programming  consider- 
ations, questions  of  format  and  content  of  output,  and  problems  of  customer  acceptance  of 
the  machine  products. 

8.  1     Questions  of  input 

Input  considerations  include,  first,  questions  of  the  extent  and  availability  of  mate- 
rial which  can  be  handled  directly  by  the  machine.    This  may  be  limited  to  title  only,  to 
title  plus  abstract,  title  plus  other  material,    1/  preselected  text  or  automatically  gener- 
ated extracts;  or  it  may    in  a  few  cases  extend  to  full  running  text.     Possible  future  re- 
quirements may  extend  to  the  processing  not  only  of  full  text  but  of  interspersed  graphic 
material  (equations,  charts,  diagrams,  drawings,  photographs)  as  well. 

We  have  considered  typical  arguments  for  and  against  the  limitation  of  input  to  titles 
only,  to  augmented  titles,  and  to  abstracts  in  other  sections  of  this  report.    The  points  to 
be  emphasized  here  are  requirements  for  pre-editing  or  post-editing,  provisions  for  error 
detection  and  error  correction,  the  time  and  cost  requirements  of  conversion  equipment  if 
material  is  not  already  available  in  machine -usable  form,  and  the  like.    As  Cornelius 
suggests: 

"Present  day  computers,  if  used  for  machine  indexing,  will  be  generally  input 
limited  and  will  require  excessive  data  preparation.    Causes  of  these  limitations 
are:    time  required  for  translation  to  machine  language,  verification  of  this  ma- 
chine language,  and  the  capability  or  lack  of  capability  of  correction  in  the  input 
media.  "  2/ 

Examples  of  pre-editing  requirements,   even  for  the  simple  case  of  keyword-in- 
title  indexing,  include  the  spelling  out  of  chemical  symbols,  the  encoding  or  the  omission 
of  subscripts  and  superscripts,  insertions  of  hyphens  to  prevent  indexing  of  a  word,  and 
substitutions  of  blanks  for  hyphens  in  compound  words  to  assure  indexing  of  each  com- 
ponent. 3/  For  full  text,  a  far  more  extensive  and  elaborate  set  of  rules  and  conventions 
must  be  developed  and  applied.  4/  Other  editing  may  be  required  for  format  standard- 


1/       This  may  specifically  include  cited  titles,  as  suggested  variously  by  Bohnert,  1962 
~         [69],  p.   19;  Giuliano  and  Jones,   1962  [229],  p.   10;  Swanson,   1963  [580],  p.  1; 

Gallagher  and  Toomey,   1963  [205],  p.  53;  and  as  used  in  the  SADSACT  method,  see 

pp.    98-99  of  this  report. 

2/       Cornelius,   1962  [140],  p.  42. 

3/       See,  for  exajnple,  Kennedy,    1961  [311],   p.  120. 

4/       See,  for  example  the  sophisticated  proposals  of  Nugent,    1959  [441],  and  Newman 
et  al,   1960  [439] . 
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ization,   especially  in  the  case  of  citation  indexes  compiled  by  machine.   1/  O'Connor  notes, 
however,  that  "the  provision  of  pre-editing  information  can  slow  down  the  keypuncher  or 
typist,  increase  the  chance  of  mistakes,  and  require  more  intelligence  or  training  on  the 
typist's  part.  "  Z_/ 

Questions  of  error  detection  and  error  correction  apply  both  to  the  original  text  and 
to  transcribed  versions  if  these  are  necessary.    That  is,  the  basic  documents  themselves 
may  contain  typographical  errors,  misspellings,  and  the  like,  and  additional  errors  are 
bound  to  occur  at  all  subsequent  stages  requiring  human  processing.    Wyllys  discusses  the 
need  for  the  correction  of  spelling  errors,  mentions  suggested  computer  programs  for 
detection,  and  cites  a  private  communication  from  Stiles  suggesting  that  the  criteria  for 
accepting  words  as  valid  be  either  that  they  are  identified  as  already  being  in  the  system 
vocabulary  or  that  they  occur  at  least  twice  in  the  input  item.  3/ 

Swanson's  analysis  of  the  reasons  for  retrieving  irrelevant,  and  failing  to  retrieve 
relevant,  material  in  the  case  of  text  searching  on  the  nuclear  physics  abstracts  includes 
!  typical  data  on  the  effect  of  errors.  4/  He  found,  for  example,  that  failures  to  record 
I  hyphenated  words,   subscripts,   superscripts  and  other  special  symbols  accounted  for  about 
I  5  percent  of  failures  to  retrieve  relevant  items,  and  errors  in  transcription  of  either  text 
or  search  instructions  accounted  for  another  3  percent  of  these  failures.    Errors  in  key- 
I  punching  of  the  search  requests  alone  accounted  for  4  percent  of  the  cases  of  irrelevant 
i'  retrievals.    By  contrast,  in  the  newspaper  clippings  experiments  where  the  input  material 
1  was  already  in  machine -usable  form  transcription  errors  were  not  a  factor  but  the  input 
'  tape  itself  had  many  errors.    In  this  special  case,  however,  Swanson  reports:  "Garbles 

are  not  important  simply  because  messages  are  sufficiently  redundant  to  insure  that  even 
I  if  one  or  two  keywords  for  a  given  category  are  garbled,  almost  invariably  others  are 

j  present.  "  5/ 

I 

The  news  clippings  material  used  by  Swanson  represents  one  class  of  materials  that 
I  are  today  initially  available  in  machine -usable  form,  because  the  original  recording  of  the 
'  message  or  text  resulted  in  a  machine -usable  medium,   such  as  punched  paper  tape.  A 
j  punched  paper  tape  is  produced  as  the  product  of  many  typesetting  operations,  especially 

for  newspaper  and  magazine  publication,  and  this  will  be  increasingly  true  in  the  future, 
s  together  with  computer-prepared  tapes  for  input  to  automatic  typographic  composing 
eqxiipment.    To  date,  however,  equipment  to  convert  from  these  tapes  to  the  particular 
machine  language  of  a  given  computer  processing  system  is  largely  non-available,  is 
^costly,  and  is  highly  subject  to  error.  6/ 


'[_!/  See,  for  example,  Atherton,   1962  [25],  p.  4;  Marthaler,   1963  [  399],  p.  22. 

However,  at  least  one  computer  program  has  been  developed  to  assist  in  this  pro- 

I  cess.    See  Thompson,    1963  [600],   p.  II-l:    "The  present  program  takes  biblio- 

I  graphic  citations  and  automatically  arranges  then  into  a  standard  format  in  such  a 

(  way  that  the  various  parts  of  the  citation  are  unambiguously  identified.  These 

I,  standardized  citations  can  later  be  processed  by  sorting  and  matching  procedures  to 

■''  identify  similar  citations  and  to  effect  various  rearrangements.  " 

2/  O'Connor,    I960  [444],   p.  8. 

;  3/  Wyllys,   1963  [653],  p.  15. 

4/  Swanson,    1961  [  586],  Appendix. 

i  5_/  Swanson,   1963  [580],  p.  5. 

6/  Compare,  for  example.  Savage,   1958  [521],  p.   11:    "The  use  of  tape  as  the 

original  input  to  the  process  has  offered  a  number  of  problems  which  have  yet  to  be 

ii  solved.    One  is  the  occurrence  of  typographical  errors.  " 


Moreover,   to  date,  very  little  material  in  the  scientific  and  technical  literature  is 
available  in  this  form.    As  of  1961,   it  was  reported  that  a  survey  by  McGraw-Hill  indicated 
that  only  about  2  or  3  percent  of  the  publications  in  the  United  States  were  then  prepared  by 
typesetting  tape,  that  most  of  this  was  in  the  form  of  Monotype  tape  which  because  of  its 
30-column  width  and  special  format  is  not  generally  compatible  with  tape  reading  equip- 
ment, and  that  tapes  had  many  errors  in  them  which  would  require  considerable  effort  to 
correct,    l/  As  of  late  1963,  Bennett  reports: 

"Computer  processing  of  natural  language  text  material  requires  that  a  body  of  data 
be  available  in  machine -readable  form.    At  present  such  a  body  of  data  results  only 
from  a  direct  human  copying  process.    An  inquiry  into  existing  transcriptions  of 
text  which  were  machine -readable  showed  that  they  were  abbreviated  both  in  terms 
of  completeness  and  in  number  of  symbols  represented.    As  an  alternative  text  pro- 
duced as  a  by-product  of  typesetting  operations  is  clearly  an  eventual  possibility, 
but  present  practices  make  the  detection  of  unit  delimiters  such  as  ends -of-sentences 
difficult.  "  2/ 

In  the  future,  both  machine -usable  text  from  publishers  and  printers  and  the  similar- 
ly machine -usable  paper  tape  produced  as  a  byproduct  from  the  original  keystroking  of 
manuscript  on  such  equipment  as  Flexowriters  and  Justowriters  may  alleviate  this  problem 
for  new  items.    Nevertheless,   the  wealth  of  the  world's  present  literature,   the  informal 
and  unpublished  technical  reports  of  high  current  interest  but  limited  initial  distribution, 
and  material  acquired  from  foreign  sources,   will  continue  to  pose  for  the  foreseeable 
future  major  problems  either  of  automatic  reading  of  the  printed  page  or  of  human  re- 
transcription  at  high  cost. 

While  there  have  been  many  promising  developments  in  automatic  character  recog- 
nition techniques,  the  devices  that  are  now  available  for  production  use  are  limited  to 
small  character  sets,  such  as  a  single  alphabet  in  a  single  font,  often  of  special  design. 
The  multi-font  page  reader  is  not  only  not  yet  commercially  available  but  may  not  become 
so  for  some  years  to  come.    Even  if  it  were,  there  are  many  unresolved  and  as  yet  in- 
completely specified  problems  involved  in  the  development  of  suitable  rules  for  the  machine 
so  that  it  can  distinguish  between  title  or  page  number  and  text,  figure  caption  and  text, 
author's  name  in  a  cited  reference  and  the  title  of  the  paper  cited,  and  the  like.    A  case  in 
point,  not  only  for  automatic  reading  equipment  of  the  future  but  for  machine  processing 
of  machine -usable  material  available  today,  is  the  difficulty  of  machine  recognition  of 
punctuation  marks  as  used  for  different  purposes.  3/ 

In  the  absence,  then,  both  of  scientific  and  technical  documents  already  in  machine 
language  form  and  of  character  recognition  equipment  capable  of  reading  the  printed  page, 
we  are  left  with  the  unsatisfactory  situation  of  re -transcribing  input  material  either  by 
use  of  a  tape  typewriter  or  by  keypunching  to  punched  cards.    That  this  situation  is  un- 
satisfactory and  is  a  major  bottleneck  in  machine  processing  of  text  in  excess  of  the 
bibliographic  citation  data  only  is  evidenced  by  such  typical  statements  as  these: 

y       Cornelius,    1962  [140],   p.  47. 
2/       Bennett,   1963  [50],  p.  141. 

3^/       See  Bennett  quotation  above;  Luhn,    1959  [384],  p.  22,  and  Coyaud,   1963  [143]. 
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"The  expense  of  transcribing  such  documents  in  their  entirety  will  be  justifiable  to 
a  limited  extent  only  and  it  may,  therefore,  be  assumed  that  automatic  processing 
will  be  mainly  applied  to  future  literature.  "  l/ 

"As  long  as  we  are  limited  to  using  the  equipment  that  is  available  now,  the  pre- 
paration of  data  for  input  will  be  an  expensive  procedure  and  a  major  cost  factor  in 
automatic  processing  of  natural  language.  "  Z/ 

"...    In  a  discussion  of  indexing  by  machine,  we  must  recognize  the  preparation  of 
input  to  the  system  as  the  major  item  of  cost  of  operation.  "3/ 

"Present  inability  to  read  documents  automatically  would  make  it  necessary  to  punch 
cards  or  tapes,  an  operation  likely  to  be  even  more  expensive  than  reading  by 
hxmians.  "4/ 

In  addition  to  the  high  costs  of  manual  retranscription,  it  is  also  noted  that  keypunching 
"tends  to  undermine  the  purpose  of  natural  text  retrieval  by  requiring  human  effort  at  the 
input  end  of  the  process.  "  5/ 

In  particular,  keypunching  or  keystroking  requirements  undermine  the  purposes  of 
rapid  indexing  as  well  as  filing  for  retrieval  by  virtue  of  the  time  required  to  transcribe 
text.    Horty  and  Walsh  report,  for  example: 

"Flexowriter  operators  can  produce  between  1400  and  1800  lines  per  day  of  statutory 
text.    Keypunch  operators  used  in  previous  experiments  could  punch  approximately 
100  lines  per  hour  of  alphabetic  materials,  but  could  not  maintain  this  rate  for  a 
sustained  period  of  time.  "  6/ 

Thus,  until  such  time  as  more  versatile  character  recognition  equipment  is  available, 
even  some  of  the  most  ardent  advocates  of  full  text  processing  are  forced  to  the  use  of 
considerably  less  than  full  text  for  other  than  research  purposes.    Swanson  comments, 
for  example: 

".  .  .  One  must  note  that  the  manual  recording  of  text  may  be  exorbitantly  expensive. 
If  so,  a  judicious  selection  process  may  permit  a  reasonable  compromise  between 
the  expense  of  input  and  the  depth  of  indexing  which  results.    For  example,   it  is 
reasonable  to  select  the  title,  abstract,  table  of  contents  (if  any),  sub-headings,  and 
key  sentences  or  paragraphs.  "  7/ 

J./  Luhn,   1959  [384],  p.  2. 

2/  Ray,   1961  [496],  p.  51. 

2/  Howerton,    1961  [282],  p.  327. 

4/  Levery,   1963  [359],  p.  235. 

5/  Doyle,   1959  [168],  p.  2. 

6/  Horty  and  Walsh,    1963  [280],  p.  259. 

7/  Swanson.   1963  [580],  p.  1. 
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"Costs  come  much  more  into  line  if  we  make  available  to  the  machine  something  on 
the  order  of  one  per  cent  of  the  full  text.    Then,  of  course,  the  problem  of  selecting 
that  one  per  cent  presents  itself.  "  1/ 

8.  2     Examples  of  Processing  Considerations 

A  second  major  area  of  operational  considerations  involves  the  machine  processing 
problems,  given  a  specified  input.    For  most  of  the  automatic  derivative,  and  modified  or 
normalized  derivative,  schemes,  this  is  primarily  a  question  of  the  limitations  of  machine 
language  to  a  vocabulary  of,  typically,  no  more  than  64  distinct  characters  for  input, 
internal  manipulation,  and  output.    In  addition,  the  limited  number  of  characters  that  can 
he  packed  into  a  single  machine -word  complicates  internal  processing,  storage,  file  look- 
up (i.e.,  against  exclusion  or  inclusion  lists),   and  sorting  operations. 

Arbitrary  truncation  of  text  words  to,   say,   6  characters  per  word,  leads  to  certain 
computer  processing  or  storage  economics.    However,  it  leads  also  to  complications  in 
the  selection  of  words  either  to  be  included  (clue  word  lists)  or  excluded  (stop  lists)  in 
many  of  the  proposed  methods  both  for  derivative  and  for  assignment  indexing.  Additional 
problems  of  artificial  homography  are  created.    Obvious  examples  are  "Probab-le,  -ility"; 
"Condit-ion,  -ional,  "  "Freque-nt,  -ntly,  -ncy,  "  "Commun-ity,  -ication; -al",  and  the  like. 
Barnes  and  Resnick  include  in  their  studies  of  the  effectiveness  of  an  SDI  System  2/  the 
use  of  6  different  truncation  levels  (from  4  to  9  characters).    No  significant  differences 
were  found  in  terms  of  the  number  of  hits  (matches  of  a  new  item  to  a  user's  profile  which 
he  considered  to  be  of  definite  interest  to  him)  but  there  were  significant  differences  in  the 
number  of  notifications  sent  him,  as  presumably  matching  his  interest,  and  the  amount  of 
"trash"  (irrelevant  items)  among  these  notifications. 

The  importance  of  the  selection  criteria  in  derivative  indexing,  operationally  con- 
sidered, is  largely  a  matter  of  the  length  and  the  contents  of  the  stop  lists.    Variability  in 
practice  among  the  various  producers  of  KWIC  indexes  has  previously  been  noted,  3/  but 
there  are  some  interrelated  and  interlocking  factors  which  affect  the  q\iality,  the  costs, 
and  the  customer  acceptance  of  this  type  of  machine -generated  index.    First,  the  number 
of  pages  in  a  printed  index  is  directly  related  to  the  total  costs  of  producing  that  index.  4/ 
The  amount  of  material  covered  on  a  single  page  can  be  increased  by  photographic  or  other 
type  of  reduction  (e.g.  ,  the  96  lines  per  page  of  the  Bell  Laboratories  KWIC  program  out- 
put are  reduced  by  xerography  to  62  percent  of  the  machine  output  page  size),  (Kennedy, 
1961  [311])  but  the  reduction  must  not  be  such  as  to  exceed  reasonable  limits  of  legibility. 

This,  in  turn,  means  that  the  number  of  entries  generated  for  each  title  (obviously, 
a  function  of  the  words  that  survive  stop  list  purging)  needs  to  be  held  to  a  reasonable 
minimum.  Thus: 

"One  of  the  major  limitations  of  the  published  index  stems  from  the  conflict  between 
the  quantity  of  text  that  must  be  placed  between  the  covers  and  the  capacity  of  the 
printed  page  to  handle  it.    The  size  of  the  page  and  the  legibility  of  the  printing 
determines  the  maximum  density  of  characters  which  can  be  read  without  special 
aids. "  5/ 

1/       Swanson,   1962  [584],  pp.  470-471. 

2_/       Barnes  and  Resnick,    1963  [36].    See  also  p.  148  of  this  report. 

3/       See  discussion,  pp.  65-66. 

4/       See  Markus,   1963  [394],  p.  16. 

5/       Taine,   1961  [592],  p.  153, 
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The  question  of  stop  list  effectiveness  therefore  becomes  an  operational  factor  as  well  as 
one  that  may  affect  the  quality  and  acceptability  of  the  product.    On  the  other  hand,  too 
generous  a  purging  of  the  input  titles  may  of  course  reduce  the  utility  of  the  title  index  by 
the  elimination  of  too  many  potential  access  points  and,  in  particular,  many  that  users 
may  be  most  tempted  to  look  for. 

A  related  problem  has  to  do  with  the  number  of  pages  required  because  of  the  length 
of  the  title  line  allowed  in  the  listings.    A  suggestion  advanced  by  Brandenberg  (1963  [80]) 
is  the  assignment  of  numeric  codes  to  the  machine  stop  words  used  and  the  insertion  of 
these  codes  into  the  listed  title  line  in  the  place  of  these  presimiably  insignificant  words. 
Thus  one  of  the  KWIC  entries  for  the  title,   "Determining  Aspects  of  the  Russian  Verb 
from  Context  in  Machine  Translation"  might  go  from: 

RMINING  ASPECT  OF  THE     CONTEXT  IN    MACHINE  TRANSLATION.     /DETE  to: 
ERMINING  032  416  712  RUS    CONTEXT  308    MACHINE  TRANSLATION.  /DET 

This  particular  example  was  picked  at  random  from  a  KWIC  index  utilizing  a  103-106 
character  title  line,  _l/  but  it  was  deliberately  shortened  to  the  60-character  line  length 
found  in  many  such  indexes  in  order  to  illustrate  effects  of  chopping  and  wrap-around. 
Coincidentally,  it  also  illustrates  some  of  the  difficulties  of  designing  a  well-balanced 
exclusion  list  since  in  this  case  the  purged  word  "aspect"  is  apparently  being  used  in  a 
technical  sense  rather  than  in  the  common  one  of  "Various  aspects  of.  .  .  ".    By  accident, 
this  case  does  show  rather  severe  "aspects"  of  the  chopping  problem  in  the  loss  also,  for 
this  entry,  of  "Russian"  and  "verb"  although  they  would  of  course  be  picked  up  in  the  entry 
blocks  for  these  words.    Certainly,  however,  the  claimed  advantages  of  context  checking 
are  not  striking,  even  without  the  introduction  of  the  numeric  codes.    It  is  true  that  for 
excluded  words  longer  in  length  than  those  in  our  example  the  possible  conservation  of  the 
character-space  to  reduce  the  chopping  effects  for  the  same  length  line  may  result  in  im- 
provements.   However,  the  replacement  of,  for  example,   "Preliminary  investigations 
of.  .  .  "  by  numeric  codes  would  hardly  assist  the  user  in  determining  quickly  from  the 
many  possible  entries  under  ".  .  .  "  which  he  should  select  for  further  personal  perusal. 

Turning  to  the  case  of  automatic  assignment  indexing,  the  processing  considerations 
likely  to  be  involved  in  operational  factors  affecting  the  evaluation  of  a  system  are  much 
I  less  easily  exemplified.    Obviously,  conditions  that  hold  for  research  experiments  on 
small  (and  usually,  especially  selected)  samples  do  not  necessarily  relate  to  requirements 
in  potential  productive  applications.    Exceptions  are  the  problems  of  the  sizes  of  term- 
I  term  and  term-document  co-occurrence  correlation  matrices  that  can  be  readily  manipu- 
I  lated,  previously  mentioned,   2/  and  the  concurrent  problems  of  the  size,  and  hence  the 
!  representativeness,  of  inclusion  lists  or  clue-word  voca,bularies  that  can  be  accommodated. 

Both  Maron  and  Borko  found,  even  in  their  limited  test  samples,  a  certain  proportion 
of  new  items  that  could  not  be  indexed  or  categorized  at  all  because  these  new  items  dj.d 

I  not  contain  any  of  the  clue  words  recognizable  by  the  system,  "hj   Due  perhaps  to  longer 
selective  clue  word  lists,  as  well  as  to  the  special  nature  of  his  items,  Swanson  found  no 
instances,  for  775  test  items,  of  failure  to  assign  because  of  lack  of  indicative  clues  in  the 
input  material.    In  the  case  of  60  tests  against  the  SADSACT  model,  which  uses  approx- 
imately  1,  600  words  drawn  from  a  "teaching  sample"  of  items  previously  indexed  to  de- 

|iScriptors,   (related  by  frequency  of  co-occurrence  to  any  of  70-odd  descriptors  with  whose 

I'  ■ 

\  \J      Walkowicz,   1963  [629],  pp.  136  and  137. 
j  2_/      See  pp.  108  and  160  of  this  report. 

i3/      See  Maron,   1961  [395];  also  Borko  and  Bernick,   1963  [78], 

I' 
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assignment  they  had  co-occurred),  the  machine  had  a  sufficient  basis  in  the  input  material 
for  the  derivation  of  a  selection-score  for  at  least  12  descriptors  for  each  new  item.  The 
items  were  closely  similar  to,   though  not  identical  with,  the  source  items  from  which  the 
word  associations  with  descriptors  assigned  had  been  drawn.    The  sample  is  obviously 
critically  small.    Nevertheless,  the  possibility  that  extensive  clue  word  lists,  notwith- 
standing the  incorporation  of  trivial  and  even  erroneous  associations,   can  be  used  as 
effectively  as  smaller,  more  precise,  and  more  carefully  tailored  lists,  but  with  signifi- 
cant gains  in  memory  space  or  computational  requirements,  is  suggestive.    A  somewhat 
related  conclusion,  again  reflecting  the  effect  of  processing  requirements,  is  stated  by 
Needham  as  follows: 

"The  main  point  to  be  made  is  that  theoretical  elegance  must  be  sacrificed  to  com- 
putational possibility:  there  is  no  merit  in  a  classification  program  which  can  only 
be  applied  to  a  couple  of  hundred  objects.  "  _l/ 

In  KWIC  type  derivative  indexing  by  machine,  except  in  terms  of  allowable  character 
sets  and  word-lengths  conveniently  processed,  the  problem  of  appropriate  programming 
languages  does  not  arise  to  any  serious  extent.    For  the  processing  of  material  in  research 
on  natural  language  text,  however,  the  choice  of  interpretative  and  compiler  types  of  auto- 
matic programming  languages  may  involve  computational  requirements  which,  while  being 
inappropriate  in  a  production  situation,  offer  considerable  flexibility  and  versatility  for 
experimental  purposes.    Examples  of  special  programs  of  this  type  include  the  use  of 
Yngve's  COMIT  by  Baxendale  and  Knowlton,  the  development  and  use  of  FEAT  by  Olney, 
Doyle,  and  others  at  SDC,  and  the  use  of  list-processing  techniques  in  the  General  Inquirer 
system.  2/  Yngve  describes  the  use  of  his  program  as  follows: 

"COMIT  has  also  been  used  in  the  experimental  work  in  information  retrieval  of 
Baxendale  and  Knowlton  at  IBM.    The  purpose  of  their  COMIT  program  was  to  accept 
as  input  the  title  of  a  document  and  to  produce  as  output,  not  only  descriptors,  but 
pairs  of  descriptors  which  are  roughly  of  the  form  adjective -noun.    The  purpose  of 
the  work  is  to  automatically  generate,  from  document  titles,  retrieval  words  of  a 
more  specific  nature  than  simply  Boolean  functions  of  the  existence  of  certain  words 
in  a  title.  " 

The  FEAT  program  was  designed  originally  for  word  and  significant -word-pair 
frequency  counts.    Olney  describes  the  program  in  part,  as  follows: 

"FEAT  is  designed  to  perform  frequency  and  summary  counts  of  words  and  word 
pairs  occurring  in  its  natural  text  input;  i.  e. ,  text  written  in  ordinary  English  and 
transcribed  into  Hollerith  code  according  to  some  set  of  keypunching  rules.  To 
focus  attention  on  the  semantic  aspects  of  word  pairs  rather  than  on  their  syntactic 
aspect,  pairs  of  which  one  member  is  a  function  word,   such  as  'the',    'is',  'by', 
etc.,  are  excluded.  " 

"Using  a  bucket  list  structure  of  the  type  proposed  by  C.  J.  Sheen  in  FN- 1634,  the 
program  sorts  each  incoming  word  serially,  constructing  a  list  within  each  of  256 
buckets  for  good  words  of  a  given  alphabetic  range  .  .  .  and  another  list  within  each 
good  word  entry  for  the  Doubles  and  Reverses  which  will  be  ordered  alphabetically 

\J       Needham,   1963  [433],  p.  8. 

2/       Stone,  et  al,  various  references,   p.    137  of  this  report. 
3/       Yngve,   1962  [655],  p.  26. 
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on  that  word  ...    If  there  are  four  different  Double  types  of  which  the  first  word  is 
'external'  the  addresses  of  the  four  different  second  words  form  a  new  list  which  is 
linked  to  the  entry  for  'external'.    Each  word  type  occurs  only  once  in  core,  and  all 
word  pairs  of  which  it  is  a  member  refer  to  it  by  means  of  its  core  addresses.  " 

"The  program  could  process  millions  of  words,  automatically  generating  frequency 
counts  far  larger  than  the  Thorndike  and  Lange  counts,  which  cost  many  man-years, 
and  in  addition,  FEAT  would  provide  complete  lists  of  word  pairs  (Doubles  and 
Reverses),  which,   so  far  as  we  know,  have  never  been  counted  in  a  sample  of  appre- 
ciable size,   despite  their  importance  for  semantic  analysis  of  text.  " 

FEAT  is  used,  together  with  a  modified  version  of  the  Proto -Synthex  program,  and 
special  output  formatting  routines,  for  another  SDC  program,  the  Descriptor  Word  Index 
Program,  which  produces  a  content-word-concordance  for  natural  language  text  as  well 
as  statistics  reflecting  the  type  of  words  that  occur,  frequencies  of  occurrence,  and  posi- 
tional data,   (Olney,    1960  [457],    196 1  [ 456] ;  Stone,    1962  [574]. 

The  IPL-V  list-processing  language  is  used  by  Kochen  in  some  of  his  work  on  sim- 
ulated concept  processing  by  machine.     Programs  for  accepting  sentences  written  in  a 
formal  language  which  was  constructed  of  names  and  logical  predicates  (inserted  either 
from  a  console  or  in  the  form  of  punched  cards),  for  updating  and  re -organizing  a  file  of 
such  sentences,  for  storing  and  manipulating  metalinguistic  sentences  such  as  "If  X  is 
author  of  Y  and  Y  pertains  to  topic  Z,   then  X  has  worked  on  Topic  Z",  for  interrogating 
the  file,  and  for  tracing  associations  between  names  linked  through  various  predicates, 
have  been  written  in  this  language,  l/ 

8.  3     Output  Considerations 

Turning  to  operational  problems  of  output,  the  question  of  limitations  of  computer 
printout  language  to,  in  most  cases,  a  single  set  of  upper  case  alphabetic  characters, 
numerals,  and  a  few  special  symbols,  2/  is  a  serious  factor  in  customer  acceptance  with 
respect  to  appearance  --  format,  legibility,  readability.    Involved  here  are  questions  pre- 
viously mentioned.    Where,  in  the  only  presently  available  outputs  of  machine -generated 
indexes,  the  KWIC  type  permuted  title  indexes,   should  the  indexing  access  point  "slot"  be 
on  the  page?  Should  all  or  only  part  of  the  title  be  displayed?   Should  60-  or  10  6-character 
lines  be  used?    More  detailed  discussion  of  these  and  related  points  are  provided  by,  for 
example,  Youden  (1963  [658])  Kennedy  (1962  [311])  and  Brandenberg  (1963  [80]). 

A  separate,  but  related  question,  is  how  much  identification,  and  in  what  form, 
should  be  provided  for  the  item  itself  either  directly  as  a  part  of  the  index  entry  or  by 
cross-reference  to  the  address  of  more  detailed  information.    There  seems  to  be  quite 
general  agreement  that  the  typical  user  needs  something  more  than  author's  name  and  title 


_!/       Kochen,  et  al,   1962  [328],  p.  34. 

2_/       See,  for  example,  Lipetz,   I960  [365],  p.  252:    "A  disadvantage  of  keypunched  cards, 
however,  is  the  lack  of  capacity  to  record  or  to  print  other  symbols  than  a  one -case 
alphabet,  one  case  of  arabic  numerals,  and  about  a  dozen  punctuation  marks  and 
miscellaneous  symbols.    Citations  in  the  scientific  literature  generally  make  use  of 
a  much  larger  number  of  significant  symbols:    multiple  cases,  multiple  fonts,  italics, 
boldface,  Greek  letters,  mathematical  symbols,  etc.  "   Note,  however,  that  Chem- 
ical-Biological Activities,  a  digest  produced  by  Chemical  Abstracts  Service,  uses 
printouts  of  the  modified  IBM  1403  chain  printer,  using  120  characters  (see  Fig.  5). 
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alone  to  guide  him.   1/  However,  if  the  full  bibliographic  citation,  perhaps  the  abstract  as 
well,  is  to  be  printed  out  by  machine,  the  problems  of  limited  character  set  are  even  more 
severe.    This  problem  is  today  being  solved,  in  some  cases,  by  separate  operations  in- 
volving sorting  and  assembly  of  the  full  citations  and  abstracts  of  the  items  indexed,  sepa- 
rately prepared,  for  photographic  reproduction  or  typesetting.    Hopefully,  this  partial 
solution  will  become  obsolete  as  automatic  type -composition  equipment  and  computer -pre - 
pared  typesetting  techniques  become  more  generally  available. 

Operational  considerations  thus  involve  the  costs,  the  availability,  and  the  limitations 
of  equipment  now  usable  for  machine -generated  index  production.    Sch\altz  and  Schwartz 
report,  as  of  October,   196  2. 

"There  are  two  major  bottlenecks  in  automated  index  production  caused  by  inadequate 
equipment  development  at  the  present  state-of-the-art: 

"1.      There  is  no  way  of  using  automatic  input  of  the  printed  page  or  the 
indexer's  notes; 

"2.      There  is  insufficient  flexibility  in  the  forms  of  output  available  for  a 
computer -produced  index. 

Both  of  these  areas  are  being  worked  on  by  equipment  manufacturers,  and  an  early 
solution  has  been  promised.  "  2/ 

In  general,  operational  considerations  of  this  type  do  not  affect  the  appraisal  of  auto- 
matic assignment  indexing  techniques,  because  these  have  not  yet  been  developed  to  the 
point  of  practical  application  on  any  realistic  scale.    Moreover,  the  difficulties  of  problem 
definition  and  basic  understanding  of  language  and  meaning  yet  remaining  to  be  resolved 
are  such  that  radical  new  advances  in  computer  technology,  associative  memories,  char- 
acter readers  and  pattern  recognition  devices  may  completely  alter  the  picture  before 
practical  systems  are  ready  for  operational  tests.    Thus,  for  example,  it  is  claimed: 

"It  appears  desirable  to  begin  experimentation  with  automatic  indexing  so  that  solu- 
tions will  become  known  by  the  time  character  recognition  equipment  will  have  pas- 
sed the  laboratory  stage. "  3/ 

Similarly,  Doyle  suggests  that  the  "present  rate  of  solution  of  the  intellectual  problems  of 
IR  is  sufficiently  slow  that  these  advanced  devices  will  be  in  common  use  long  before  IR 
will  truly  benefit  from  their  presence",  and  he  urges  that  researchers  proceed  as  though 
such  machines  were  already  with  us.  4/ 


\J       Compare,  for  example,  Montgomery  and  Swanson,    1962  [421],  p.  366:    "This  study 
suggests  that  indexing  should  be  based  on  more  than  titles  and  that  a  bibliographic 
citation  system  should  present  to  the  requestor  something  more  than  titles";  See 
also,  in  addition  to  references  cited,  p.  61,    footnote  1,  IBM  "ACSI-matic  auto- 
abstracting  project.  .  .  ",  Vol  3,    1961  [290],  p.  89:    "The  use  of  titles  in  document 
searching  without  any  additional  abstract  seems  to  lead  to  a  high  number  of  .  .  . 
errors,  i.  e.  ,  accepting  documents  which  should  be  rejected,  as  not  enough  informa- 
tion is  available  to  judge  the  pertinence  of  documents.  " 

2j       Schultz  and  Schwartz,   19o2  [531],  p.  432. 

3_/       Levery,    1963  [359],  p.  235. 

4/       Doyle,   1961  [169],  p.  3. 
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Q.    CONCLUSION:   APPRAISAL  OF  THE  STATE  OF  THE  ART  IN  AUTOMATIC  INDEXING 


Notwithstanding  the  difficulties  of  evaluation  we  have  discussed,  we  shall  herewith 
attempt  to  evaluate  the  present  state  of  the  art  in  automatic  indexing  techniques,  using  such 
available  criteria  as  seem  most  appropriate.    First,  we  suggest  that  all  of  out  initial 
questions  except  possibly  the  last,   can  today  be  answered  affirmatively.    "Is  indexing  by 
machine  possible  at  all?"    To  this  we  can  answer  an  unequivocal  "yes"  in  view  of  the  many 
examples  of  KWIC  type  indexes  extant  and  in  practical  use.    Secondly,   "Is  what  can  be  done 
by  machine  properly  termed  'abstracting',    'indexing',   or  'classifying'?"   If,  by  definition, 
word  indexing  of  any  kind  is  not  "properly  termed.  .  .  indexing",  then,  as  we  have  seen, 
automatic  derivative  indexing,   such  as  KWIC,  or  the  selection  of  words  to  serve  as  index 
tags  based  upon  the  frequencies  of  their  occurrence  in  text,   is  not  so  either. 

The  fundamental  Luhn  concept  for  indexing  based  on  word  frequencies  is,  as  we  have 
1  seen,   straightforward:    namely  that,  after  disregarding  the  most  frequent  "common  words", 
I  especially  those  that  are  syntactic -function  words  --  articles,  conjunctions,  prepositions, 
and  the  like,  together  with  those  words  that  occur  infrequently  in  a  given  text,  the  remain- 
ing high  frequency  words  should  give  a  reasonable  indication  of  what  the  author  was  writing 
"about".    Critiques  of  the  Luhn  position  have  been  made  on  several-fold  grounds: 

(1)       Information-theoretic  -  that,  in  fact,  the  most  information  is  conveyed  by 

the  least  frequent  words. 
(Z)      Absolute  vs.  relative  frequencies  of  usage  within  specialized  fields. 

(3)  Modifications  of  semantic  purport  by  contextual  and  syntactic  associations. 

(4)  Problems  of  synonymity  and,  conversely,  of  orthographic  ally  identical 
words.  _l/ 

(5)  Multi-aspect  points  of  interest,  and  future  need  of  access  to  material  the 
author  himself  did  not  emphasize. 

The  last  point  raises  again  the  criticisms  that  have  been  made  against  derivative, 
j  extractive  or  "word"  indexing  of  all  types.    To  repeat,  although  such  procedures  may 
index  "as  the  author  himself  indexed  best  --  in  his  own  language",  the  significant  points 
are  (1)  there  may  be  peripheral,  minor,  or  unrecognized  aspects  of  his  topic  and  incident- 
al information  disclosed,  of  future  interest  to  others,  which  the  author  himself  is  in  no 
special  position  to  recognize,  and  (2)  notwithstanding  the  "author's  own  terminology"  being 
j  current  usage  rather  than  the  "fossilized"  vocabulary  of  any  previously  established  classi- 
fication or  indexing  scheme,  this  very  "currency"  changes  from  field  to  field  and,  quite 
literally,  from  day  to  day.    Nevertheless,  it  should  be  re-emphasized  that  the  validity  of 
these  criticisms  is  not  limited  to  automatic  derivative  indexing  as  such,  but  rather  is 
Ij  applicable  against  any  indexing  system  whatsoever,  manual  or  machine,  which  is  so 

strictly  limited  to  author -terminology,  author -emphases,  and  the  consideration  of  the 
I  docioment  at  hand  as  a  self-contained  entity,  without  regard  to  other  documents  in  a  col- 
'  lection,  in  a  particular  field,  and  without  respect  to  specific  user  needs.    By  contrast  to 
this  type  of  limitation,  more  promising  approaches  should  stress  both  similarities  and 
differences  between  a  new  document  and  previously  received  documents,  between  docu- 
ments "belonging"  to  some  definable  category,  or  not,  and  even,  as  responsive  to  a  partic- 
ular user's  profile-of-interest,  or  not. 


See  Baxendale,   1962  [42],  pp.  67-68:    "...  resolution  of  orthographic  ambiguities 
is  a  non-trivial  and  over-riding  prerequisite  for  the  computer  processing  of 
text.  .  .  ",  p.  67. 
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Derivative  indexing,  whether  by  man  or  machine,  is  thus  subject  to  many  disadvan- 
tages.   First  and  foremost,  it  is  constrained  by  a  particular  individual's  personal  manner 
of  expression  of  concepts  in  language.    This  limitation  is  controlled  only  by  his  presump- 
tive desire  to  communicate  with  some  particular  (more  or  less  general,  or  more  or  less 
specialized)  audience.    His  choices  of  natural  language  expressions,  however,  will  be 
conditioned  by  at  least  some  of  the  following  factors: 

(1)  The  range  and  precision  of  his  personal  mastery  of  both  general  and 
specialized   vocabularies  for  a  given  time,   place,  and  specialized  field 
of  discourse. 

(2)  His  personal  expectations  as  to  the  probable  reactions  (in  the  sense  of 
effective  communication)  of  his  intended  audience  to  the  expressions  that 
he  does  choose,   involving  all  of  the  problems  of  different  usages  of  tech- 
nical terminology  from  field  to  field,  from  formal  to  informal  presenta- 
tions, from  scholarly  reviews  to  progress  reports  heavy  in  current 
"technese"  and  "fashionable  words". 

(3)  His  habits  of  thought  and  his  training  in  his  field. 

(4)  His  awareness  of  more  than  one  possible  audience  and  of  more  than  one 
point  or  topic  of  potential  interest  to  his  readers. 

Secondly,   indexing  by  the  author's  own  words  is  remarkably  sensitive  to  a  particular 
period  of  time,   so  that  the  terminology  becomes  rapidly  outdated  and  often  seriously  mis- 
leading in  its  connotations.    Thirdly,  the  user  has  no  adva.nce  knowledge  of  the  terminology 
that  has  been  used  in  all  the  varied  texts  of  a  collection  and  he  must  therefore  be  able  to 
predict  a  wide  variety  of  possible  ways  of  expressing  ideas  in  words,   phrases,  and  even 
by  implication.    Fourthly,  for  collections  indexed  on  a  word-derivative  basis,  there  is 
little  or  no  possibility  for  generic  searching.    1/   Finally,  there  is  the  more  general 
question,  applicable  to  both  derivative  and  assignment  indexing,  of  how  well,  ever,  can  a 
condensed  representation  serve  the  purposes  of  specific  subject  content  recapture?   In  the 
strict  sense,  only  by  the  elimination  of  truly  redundant  information.    But  even  this  is  a 
relative  matter.    What  is  redundant  for  an  author  may  not  be  so  for  several  different  po- 
tential users  of  the  reports  or  papers  that  this  author  writes.    What  is  redundant  for  one 
user  is  not  necessarily  so  for  others. 

The  further  problem  for  machine  techniques  is  therefore:   how  selection  rules  can 
be  provided  that  will  replicate  a  given  human  pattern  of  selectivity,  or,  alternatively,  how 
selection  rules  can  be  established  and  defined  that  will  produce  an  equivalent  and  compar- 
able result  -  that  is,  one  which  typical  users  would  agree  is  as  pertinent  to  their  query- 
answer  relevance  decisions  as  any  available  alternative. 

Certainly  the  problem  of  appropriate  selection  is  at  the  heart  of  the  matter.    This  is 
a  crucial  question,  even  if  we  sort  out  and  can  specify  the  different  uses,  for  a  particular 
collection,  a  particular  clientele,  at  a  particular  time,  that  automatically  generated  con- 
densed document  representations  may  have.    Wyllys,  in  appraising  automatic  abstracting 
efforts,  considers  that  the  goal  should  be  to  provide  extracts  which  will  serve  a  search- 
tool  function  --  that  is,  they  will  furnish  the  searcher  with  enough  information  about  the 
document  content  so  that  he  may  decide  whether  it  is  probably  pertinent  to  his  then  interests 
or  not  and  hence  decide  whether  or  not  to  read  the  document  in  full.    By  contrast,  he  says 
of  the  "content-revelatory  function"  that  an  abstract  should:    "furnish  the  reader  with 
enough  information  about  the  related  document  so  that  in  most  cases  he  will  not  need  to 
read  it  itself.  "  2/ 


_l/       See  for  example,  Doyle,   1963  [  162],  with  respect  to  lack  of  capacity  for  generic 
searching  as  one  of  the  major  disadvantages  of  natural  text  search  systems. 

2/       Wyllys,   1963  [653],  p.  6. 
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Let  us  recall  the  objections  to  the  use  of  the  terms  "auto-encoding"  (or  "auto-index- 
jing"  or  "auto -abstracting")  because  of  the  possible  connotation  of  self -encoding,   etc..  \J 
'This  is  an  objection  based  upon  avoiding  ambiguous  or  misleading  terminology,  but  it  also 
I  points  to  an  objection  as  to  the  principle  involved- -that  is,  of  treating  the  document  itself, 
linits  own  right,  as  a  self-sufficient,   self-contained,  universe  of  discourse,  and  of  assum- 
i  ing  that  some  type  of  summation-condensation  over  a  number  of  different  and  individually- 
derived  representations  of  the  separate  documents  in  a  collection  can  provide  an  effective 
selection-retrieval  guidance  system  to  the  contents  of  various  specific  documents  in  that 
collection.    Even  when  the  actual  operations  are  to  be  abetted  by  synonym  reduction  and 
normalization  procedures  (whether  at  the  indexing  or  search  negotiation  stage,  or  both), 
there  is  a  significant  difference  between  this  endogenous  hypothesis  and  its  exogenous 
I  alternative:    that  the  basis  for  automatic  indexing  be  the  consensus  of  the  collection,  or  of 
I  a  sample  of  the  collection,   or  of  prior  indexing. 

Assignment  indexing,  especially  in  the  sense  that  concept-indexing  is  the  goal,  may 
be  subjectively  preferable  to  derivative  indexing  not  only  because  it  involves  exogenous 
emphases  but  because  it  tends  to  delimit,  centralize,  and  standardize  the  access  points 
available  to  the  user  in  his  search-retrieval  operations.    However,  in  terms  of  the  human 
indexing  situation,  it  involves  all  the  traditional  difficulties  of  indexing  -  which  in  turn 
invoke  the  problems  of  evaluating  indexing  systems: 

"Justification  for  any  indexing  technique  must  ultimately  be  based  on  successful 
retrieval.    Success  can  only  be  evaluated  in  terms  of  a  closed  system;  that  is,  a 
system  wherein  sufficient  knowledge  is  available  of  the  entire  contents  of  the 
materials,   so  that  an  evaluation  can  be  made  of  various  techniques  as  to  their 
retrieval  effectiveness.    The  various  systems  .  .  .  cannot  really  be  weighed  except 
on  the  basis  of  a  test  comparing  one  against  the  other.    This  has  not  been  done  in 
any  place.  "  Zj 

Nevertheless,  there  are  a  variety  of  reasons  for  accepting  even  the  relatively  crude 
derivative  indexing  products  as  practical  tools  today,  for  seeking  machine -usable  rules 
for  the  improvement  of  these  products,  and  for  continuing  research  efforts  in  automatic 
assignment  indexing  and  automatic  classification.    There  are,  first  and  foremost,  the 
cases  where  conventional  indexes  are  inadequate  or  non-existent.    Thus  Wyllys  claims: 

"It  is  well-known  that  the  current  methods  of  producing,  through  human  efforts, 
condensed  representations  of  documents  are  already  hopelessly  inadequate  to  cope 
with  the  present  volume  of  scientific  and  technical  literature.    Many  papers  are 
never  indexed  or  abstracted  at  all,  and  even  in  the  cases  of  those  that  are  indexed 
or  abstracted,  the  indexes  and  abstracts  do  not  become  available  until  six  months 
to  two  years  after  the  publication  of  the  paper.  "  "hj 

Again,   with  respect  to  automatic  derivative  indexing,   especially  KWIC  indexes  based 
on  titles  alone,  there  can  be  no  question  as  to  the  evaluation  criterion  of  timeliness.  The 
success  of  this  aspect  is  widely  acknowledged  by  users,   systems  planners,  and  interested 
observers.    On  the  other  hand,  there  is  very  little  reported  evidence  available  on  which 

\J       See  p.  3  of  this  report. 
2/       Black,   1963  [64],  p.  16. 
3/       Wyllys,   1961  [650],  p.  6. 
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any  objective  measure  of  comparative  cost-benefit  ratios  may  be  obtained.    Black  reports, 
but  without  supporting  data,  that: 

"It  has  been  estimated  that  the  efficiency  of  KWIC  indexing  is  about  76  per  cent  com- 
pared with  about  82  per  cent  for  conventional  indexing  or  classification.  "  l/ 

White  and  Walsh  report  that: 

"From  the  limited  experiment  on  methods  of  indexing  the  1962  issues  of  the  Abstracts 
of  Computer  Literature,   the  permuted  title  indexing  retrieved  only  52  percent  of  the 
information.    This  low  percentage  may  be  attributed  to  the  changing  and  not  yet 
uniformly  standardized  terminology  existing  in  computer  technology.  "  2/ 

KWIC  indexes,  because  of  their  very  currency,  are  fulfilling  significant  maintaining- 
awareness  needs  today.    Improved  titling  practice,  enforced  by  editorial  rigor  or  contract- 
ual requirements  or  both,  can  improve  their  usefulness.    They  fill  gaps  in  the  bench 
scientist's  or  engineer's  ability  to  know  about  what  might  be  of  interest  to  him,  either 
because  the  material  is  not  otherwise  covered  in  normal  secondary  publication  (e.g.,  con- 
ferences and  proceedings  of  symposia,  internal  technical  reports  not  produced  ^n  Govern- 
ment contracts  and  therefore  not  announced  and  indexed  by  the  cognizant  agencies,  and  the 
like)  or  because  the  sheer  bulk  of  the  product  of  indexing -abstracting  services  in  his  fipld 
prevents  his  effective  use  of  these  services  unless  more  specific  access  points  are  pro- 
vided.   The  claim  that  "something  is  better  than  nothing"  is  not  without  merit,   3/  even 
with  all  the  problems  of  non-resolution  of  synonymity,  homography,  topical  scatter,  long 
blocks  of  entries  under  the  sorting  term,  the  even  more  significant  disadvantages  of  author- 
bias  towards  his  principle  topic,  the  author's  choice  both  of  emphasis  and  terminology, 
and  the  like.    Williams,   considering  word-with -context  indexes,   whether  limited  to  title 
only  or  to  titles  with  readily  available  augmentation,  makes  the  following  comments: 

"Limitations  and  other  troublesome  features  of  the  method  have  been  obvious,  but 
perhaps  over  obvious,  in  the  light  of  its  growing  acceptance  and  of  the  basic  validity 
of  permitting  a  document  to  speak  for  itself,  even  in  a  much  abstracted  recapitulation. 
Wherever  there  are  large  and  growing  problems  in  maintaining  publication  schedules 
for  established  subject  indexes,  or  wherever  pressing  needs  develop  for  more  fre- 
quent indexes,  for  rapid,  low-cost  cumulation,  or  for  indexes  in  areas  where  suit- 
able indexing  services  are  wanting,  there  no  apology  is  needed  for  proposing  that 
this  method  be  considered  and  tried,  as  a  precursor  to  'better'  indexing,  if  not  as  a 
substitute.    Its  use  may  be  of  interest    also  in  less  troubled  circumstances,  in  its 
own  right,  and  because  of  common  elements  involved  in  its  production  and  the  pro- 
vision of  other  wanted  products  and  functions  (catalog  records,  current -awareness, 
lists,  etc).  "  4/ 

Returning  to  the  question  of  whether  automatic  indexing  is  possible,   it  can  be  seen 
that,  at  least  in  the  derivative  indexing  sense,  it  is  not  only  possible  but  can  be  practically 
useful.    To  dismiss  the  evidence  of  automatic  derivative  indexing  operations  that  are  in 
production  today  by  rigorous  definition  of  what  indexing  is  in  effect  anticipates  both  our 

\J       Black,   1962  [65],  p.  318. 

Zj       White  and  Walsh,    1963  [639],  p.  346. 

3^/  See  Veilleux,  1962  [624],  p.  81:  "Accepting  the  premise  that  partial  control  of  in- 
formation satisfies  more  consumers  than  absence  of  control,  perfection  was  traded 
for  currency. " 

4/       T.  M.  Williams,  private  communication,  dated  January  4,  1962. 
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t±iird  and  fourth  questions:    whether  machine -generated  indexes  are  as  good  or  better  than 
the  products  o£  human  operations  and  o£  how  we  can  measure  and  appraise  the  adequacy  of 
any  indexing  system  whatever.    Here  are  encountered  the  "core"  problems  of  meaning  in 
communication,  of  information  loss  in  any  reductive  transformation  of  act\aal  messages  or 
dociiments,  of  relevance  of  particular  messages  to  particular  queries  and  to  particular 
human  needs,  of  judgments  of  relevance. 

Because  of  these  underlying  yet  overriding  questions,  the  state-of-the-art  in  the 
evaluation  of  indexing  systems  is  in  fact  far  more  primitive  than  that  of  automatic  indexing 
itself.    An  easy,  and  an  early,  solution  is  not  likely.    Therefore,  today,  in  appraising 
machine  potentials  for  assignment  indexing  we  are  faced  with  what  is  in  effect  a  single 
criterion:   namely,  will  a  given  group  of  human  evaluators,  whatever  their  standards  and 
requirements,  agree  as  much  with  the  products  of  an  automatic  indexing  procedure,  other- 
wise competitive  on  a  cost-benefit  ratio  with  human  indexing  of  the  same  material,  as  they 
do  amongst  themselves? 

Within  the  limits  of  small,   specially  selected  samples  of  docimient  or  message  col- 
lections, it  is  possible  to  demonstrate  that: 

(1)  Replication  of  the  products  of  at  least  some  existing  systems,  within  the 
consistency  levels  observed  for  these  systems,  can  be  achieved. 

(2)  Retrieval  effectiveness  with  respect  to  relevant  items  indexed  by  auto- 
matic assignment  procedures  can  be  at  least  as  good  as,  and  may  be 
superior  to,  that  obtained  from  run-of-the-mill  manual  indexing  of  the 
same  items. 

(3)  Costs  of  indexing  can  be  held  at  or  below  the  costs  of  equivalent  manual 
indexing,  provided  both  that  the  input  material  required  is  already  in 
machine -usable  form,  or  can  be  held  to  an  average  of,   say,   100  words  or 
less,  and  that  the  clue -word  lists,  association  factors,  or  probabilistic 
calculations  can  be  accommodated  within  internal  memory. 

(4)  Significant  gains  in  time  required  to  generate  an  index  or  to  index  or  re- 
index  a  collection  can  be  achieved. 

Some  degree  of  theoretical  success  in  assignment  indexing  by  machine  can  thus  certainly 
be  claimed.    Moreover,  many  of  the  test  results  reported  do  clearly  indicate  a  quality  of 
indexing,  for  a  given  collection  at  a  given  level  of  specificity  of  indexing,  at  least  com- 
parable to  that  which  is  typically  and  routinely  achieved  by  people  in  a  practical  indexing 
situation.    No  more  should  be  asked  of  the  automatic  techniques  unless  better  human  index- 
ing can  be  specified  as  being  equally  feasible,  timely,  and  practical.    Further,  no  more 
should  be  asked  of  automatic  techniques  in  terms  of  the  evaluation  of  their  potentialities, 
than  is  now  asked  of  the  manually-prepared  alternatives.  1/ 

Data  with  respect  to  comparison  of  the  results  of  automatic  assignment  indexing 
techniques  to  either  a  priori  or  a  posteriori  human  judgment  have  been  mentioned  previous- 
ly in  this  report  in  terms  of  actual  test  results  reported,  and  the  most  significant  of  these 
reported  data  are  summarized  in  Table  Z.  Zj  Typically,  however,  these  data  reflect,  in 
varying  degrees,   so  small  a  sample  of  test  cases,  of  user  preferences,  and/or  of  special 
purpose  and  interest,  that  no  general  extropolation  is  reasonable.    Moreover,  the  general 
questions  of  the  "core"  problems  of  evaluation  in  general  again  rear  their  own  ugly  heads. 


1/  Compare,  for  example,  Kennedy,  1962  [311]  and  Needham,  1963  [433]. 
Zj       See  pp.  101-103  of  this  report. 
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Thus,  Borko  and  Bernick  point  out: 

"Up  to  this  point  we  have  used  human  classification  as  our  criterion  for  the  accuracy 
of  automatic  document  classification.    Against  this  criterion  we  have  been  able  to 
predict  with  approximately  55%  accuracy,  and  no  more.    Is  this  because  out  tech- 
niques of  automatic  classification  are  not  very  good,  or  is  it  because  our  criterion 
of  human  classification  is  not  very  reliable?    There  is  some  evidence  to  indicate  that 
the  reliability  of  human  indexers  is  not  very  high.    The  reliability  of  classifying 
technical  reports  needs  investigating  and,  perhaps  even  more  basically,  the  reasons 
for  using  human  classification  as  a  criterion  at  all.  "  \J 

In  general,  the  results  of  automatic  index-term  assignment  procedures  appear  to  run 
in  the  area  of  45-75  percent  agreement  with  prior  human  indexing,  2/ and  this  in  turn  is  well 
within  range  of,  and  often  superior  to,  estimates  of  human  inter-indexer  consistency  based 
on  actual  observations  and  tests.    There  can  be  little  or  no  doubt  that  the  results  of  auto- 
matic assignment  indexing  experiments  to  date,   (if  extrapolation  from  the  small  and  often 
highly  specialized  samples  so  far  used  in  actual  tests  is  in  fact  warranted  3/)  do  suggest 
that  an  indexing  quality  generally  comparable  to  that  achievable  by  run-of-the-mill  manual 
operations,  at  comparable  costs  and  with  increased  timeliness,  can  be  achieved  by  machine. 

The  question  which  remains  is  simply  that  of  practicality,  today.  Extrapolation 
from  small  samples  is  highly  dangerous,  as  is  well  noted  even  by  enthusiastis  for  machine 
techniques.    The  fact  that  for  at  least  some  systems,  the  limitations  on  number  of  clue 
words  that  can  be  handled  (due  in  part  to  computational  requirements,  matrix  manipulations, 
and  the  like)  are  such  that,  even  in  an  experimental  situation,  certain  "tests"  are  excluded 
from  the  result  statistics,  because  the  items  contained  an  insufficient  number  of  clues,  is 
a  serious  indictment  of  reasonable  extrapolations  for  these  techniques  today.    Most  tests 
so  far  reported  have  involved  not  only  a  highly  specialized  "sample"  library  or  collection, 
but  a  severe  limitation  on  the  total  number  of  "descriptors",   subject  headings,  or  classi- 
fication categories  to  be  assigned.    Maron  used  32,  Borko  21,  Williams  20,  SADSACT  70, 
Swanson  24.    How  would  any  of  these  approaches  fare,  given  several  hundred,  much  less 


\J       Borko  and  Bernick,    1963  [78],   pp.  31-32. 
2j       See  Table  2. 

3/       This  is  an  important,  perhaps  crucial,  caveat.    See,  for  example,  Goldwyn,  1963 

[233],   p.  321:    "In  the  micro-experiments  of  many  of  those  who  would  apply  statis- 
tical techniques  .  .  .    The  document  collection  consists  of  0-100  units.    Results  based 
on  the  manipulation,  real  or  imagined,  of  such  a  collection  can  be  valid  for  it,  yet 
become  shaky  or  even  nonapplicable  to  larger  collections";  Perry  1958  [471],  p.  415: 
"A  degree  of  selectivity  quite  acceptable  for  files  of  moderate  size  may  prove  quite 
inadequate  in  dealing  with  large  files.    This  fact  often  makes  it  necessary  to  exert 
unusual  care  and  considerable  reserve  in  evaluating  the  results  of  small-scale  tests 
and  demonstrations  which  may  tend  to  cause  the  mass  effects  of  large  files  to  be 
underestimated  or  overlooked  completely";  Swanson,    1962  [  586],  p.  288:  "The 
extent  to  which  semantic  characteristics  of  natural  language  are  susceptible  to  being 
generalized  from  small  sample  data  is  deceptive.  " 
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several  thousand,  possible  indexing  or  classificatory  labels?  l/ 

The  use  of  very  brief  short  articles,  or  of  abstracts,  as  the  members  of  experiment- 
al corpora  for  investigations  of  automatic  assignment  indexing  techniques  presuming  the 
processing  of  full  text,   either  for  indexing  purposes  or  for  subsequent  "indexing -at -time - 
of  search",  is  seriously  misleading.    First,  it  is  not  truly  representative  of  discursive 
text,  either  in  vocabulary-syntax,  or  stylistic  variations  involving  synonymity,  tropes, 
elisions,  dangling  referents,  and  inumerable  other  meaning -implications,  not  explicitly 
stated. 

Secondly,  as  any  author  of  a  technical  paper,  for  which  he  must  provide  an  abstract, 
knows  all  too  well,  he  must  concentrate  in  the  abstract  on  a  telegraphic  emphasis  toward 
his  principal  topic  and  the  points  he  wishes  to  make.    He  must  omit  most  qualifying,  spec- 
ifying, and  suggestive-of-other-leads-or-applications  words  and  phrases,  which  he  will  in 
fact  develop  in  the  text  itself.    For  this  reason,  even  supposing  that  the  author  himself  is 
unusually  well-aware  of  the  multiple  points  of  access  that  many  different  potential  users 
might  desire,  the  required  brevity  of  the  abstract  form  almost  necessarily  demands  terse, 
shorthand -type  statements  that  can  only  increase  the  problems  of  "technese",  of  homo- 
graphy,  and  of  single -subject  representation. 

Granted,  in  either  manual  or  machine -serviceable  systems  today,  the  current- 
awareness  scanning  need  is  largely  met  by  indexing  based  solely  or  primarily  on  title  only, 
or  title -plus -abstract.    But  is  this  good  enough  for  search  and  retrieval?   If  and  only  if  it 
is,  then  automatic  indexing  potentialities  available  today  should  be  considered  for  both 
purposes. 

Our  final  question  as  to  whether  automatic  indexing  can  be  accomplished  by  statisti- 
cal means  alone  or  must  involve  syntactic,  semantic  and  pragmatic  considerations  is  not 
entirely  answerable.    In  terms  of  achieving  comparable  quality  with  many  manually  pre- 
pared indexes  available  today,   statistical  means  alone  do  appear  promising.    But  is  the 
achievement  of  just  this  level  (even  if  accompanied  by  significant  gains  in  timeliness, 
coverage,  and  economy)  really  good  enough?   There  are  a  number  of  serious  investigators 


—     For  example.  Black  predicts  (1963)   [64  ]  ,  p.  19)  that  for  most  systems  an  adequate 
vocabulary  or  thesaurus  will  comprise  some  twenty  thousand  terms.    See  also 
Arthur  D.  Little,  Inc.,  1963  [  23  J  ,  p.  65:    "The  enormous  number  of  computations 
required  increases  very  rapidly  with  the  number  of  indexing  terms.     Existing  com- 
puters, operating  serially,  do  not  appear  to  be  capable  of  handling  the  problem 
economically  for  collections  with  9000  or  more  terms  even  if  the  simplest  associative 
techniques  are  employed";    Williams,  1963  i_642  ],  p.  162:    "One  of  the  practical 
problems.  .  .  is  in  the  inversion  of  large  matrices.    In  certain  methods  the  order  of  the 
matrix  will  equal  the  number  of  different  word  types  in  the  population,  which  is 
usually  in  the  thousands.  " 
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convinced  that  it  is  not,  1/  and  for  this  reason,  research  efforts  are  being  directed  toward 
these  other  considerations. 

On-going  research  and  development  work   -  whether  in  modified  derivative  indexing 
approaching  a  "concept-indexing"  level;  in  automatic  assignment  indexing  techniques  as 
such;  in  automatic  classification  or  categorization  procedures,  or  in  potentially  related 
efforts  directed  toward  automatic  abstracting,  automatic  content  analysis,  and  other 
aspects  of  linguistic  data  processing  -  is  both  reasonably  extensive  and  quite  promising. 
Most  of  the  investigators  who  are  seriously  active  in  the  field  report  their  current  object- 
ives and  recent  accomplishments  regularly  to  the  National  Science  Foundation  for  publi- 
cation in  the  series  "Current  Research  and  Development  Efforts  in  Scientific  Documenta- 
tion. "   In  the  most  recent  issue,  unfortunately  current  only  as  of  November,   1962,  there 
are  not  less  than  25  reports  of  KWIC  and  similar  title -permuted  derivative  indexing 
methods  generated  or  proposed-to-be-generated  by  machine,  there  are  several  instances 
of  investigations  into  various  possibilities  of  modified  derivative  indexing  to  be  accom- 
plished by  machine,  and  there  are  five  to  ten  reports  of  active  experimentation  with  various 
automatic  assignment  indexing  schemes.    These  efforts  and  even  more  recently  organized 
projects  point  in  the  hopeful  direction  that  "KWIC  indexes  should  be  merely  a  sample  of 
things  to  come".  2/ 

Assignment  indexing  techniques  so  far  investigated  can  be,  as  we  have  seen,  of  two 
types  which  are  quite  distinct  in  terms  of  the  principles  involved.    The  first,  which  can  be 
the  more  readily  mechanized,  involves  the  use  of  thesaurus -type  lookup  procedures  cover- 
ing the  definable  rules  of  "scope  notes",   "authority  lists",  or  "see  also"  reference  prac- 
tice.   The  second  type  of  assignment  indexing,  however,  depends  upon  decision-making  as 
to  the  propriety  of  assigning  a  particular  indexing  term  to  a  particular  document  with 
reference  to  assignments  to  the  collection  as  a  whole  (or  a  sample  thereof).    This  latter 
type  of  assignment  may  be  in  terms  of  a  priori  categorizations  of  separable  subsets  of  the 
collection. 

Alternatively,  the  bases  for  the  latter  type  assignment-indexing  procedures  may  be 
derived  from  a  posteriori  determinations  of  the  suitable  subsets  as  in  the  factor  analysis 
experiments  of  Borko,  the  latent  class  analysis  approach  of  Baker,  and  the  clustering - 
clumping  approaches  to  automatic  classification  of  Needham  and  others.  It  is  to  be  noted 
in  particular  that  Needham  thinks  an  automatically  generated  categorization  is  preferable 
precisely  because  of  lack  of  knowledge  as  to  the  exact  attributes  defining  a  class  in 


_1/       See,  for  example,  Climenson  et  al,   1962  [133],  p.  178:    "The  statistical  approach 
attempts  to  use  no  more  than  the  occurrences  of  word  spellings  and  their  relative 
distances  in  the  document  environment  .  .  ,  [and]  cannot  provide  the  discrimination 
necessary  for  most  indexing  and  abstracting  applications";  Doyle,   1963  [  162],  p.  3: 
"Automatic  indexing  and  abstracting,  as  currently  conceived,  do  not  require  any  sort 
of  dictionary  or  other  semantic  reference,  but  only  counting,  comparing,  and  sorting- 
operations  well  known  in  numerical  data  processing.    But  success  in  applying  such 
rules  on  a  purely  automatic  basis  can't  help  but  be  limited";  Borko,   1962  [75],  p.  5: 
"Although  difficult,  identification  [of  different  meanings  carried  by  the  same  word, 
of  the  same  meaning  carried  by  different  words]  must  be  accomplished  before  the 
automatic  categorization  of  document  content  can  be  truly  effective.    For  the  most 
part  statistical  methods,  and  even  syntactic  analysis,  are  inadequate  for  the  job.  A 
technique  of  textual  analysis  based  upon  the  semantic  properties  of  language  is  need- 
ed"; Grosch,   1959  [244],  p.  20:    "We  need  semantic  methods  ...  that  will  look  for 
the  intersection  of  redundant  descriptors,  each  of  which  is  at  least  slightly  errone- 
ous. " 

2/       Doyle,    1962  [163],  p.  381. 
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existing  classification  schemes.    However,  in  the  related  field  of  pattern  recognition  Uhr 
1   and  Vossler  have  shown  promising  results  both  for  criterial  feature  analysis  (a  priori 
I   assumption  as  to  attributes  or  properties  governing  membership  in  specified  classes)  and 
for  randomly  generated  discrimination  operators  which,  applied  in  a  recursive  manner, 
are  increasingly  adaptive  to  the  detection  of  class -mamber ship  (Uhr  and  Vossler,  1961 
[615]). 

One  particular  way  of  looking  at  the  problems  of  automatic  indexing  results,  in 
effect,  in  placing  these  problems  within  the  broader  field  of  pattern  perception  and  pattern 
recognition.    We  suggest  that  this  is  in  fact  a  particularly  fruitful  approach.  Certainly 
there  is  a  wide  area  of  potential  commonality,  and  many  promising  leads  for  further  re- 
search in  automatic  categorization  can  be  found  in  the  general  pattern  recognition  litera- 
ture, especially  in  work  on  randomly  generated  operators  and  on  the  problems  of  deter- 
mination of  membership  in  classes.  _l/  Conversely,  automatic  classification  techniques 
originally  conceived  as  applicable  to  the  handling  of  doc\imentary  information  have  in  fact 
been  applied  quite  successfully  to  at  least  one  case  of  groupings  of  physical  objects  on  the 
bases  of  machine -detectable  common  properties. 

The  question  of  determination  of  membership-in-classes  is  basic  to  the  problems  of 
automatic  classification  and  categorization.    Thus  the  techniques  for  discriminating  the 
statistically  significant  associations  between  "properties"  of  objects  or  items  that  are  to 
be  grouped  into  classes  or  categories,  even  when  such  "properties"  are  not  known  in 
advance  and  have  no  a  priori  identification,  point  to  an  increasing  and  promising  conver- 
gence of  research  in  pattern  recognition,  propaganda  analysis  and  psycholingui sties,  math- 
ematics and  statistics,   studies  of  linear  threshold  devices,  and  the  like,  as  well  as  in  the 
linguistic  data  processing  field  as  such. 

It  is  true  that  such  synthesized  "classes"  may  have  no  convenient  "names"  or 
linguistic  interpretations  which  make  much  sense  to  the  individual  human  searcher  or  user. 
Nevertheless,  what  is  suggested  is  that  a  radical  departure  from  conventional  habits  of 
literature  search  and  retrieval  may  be  de  sirable  from  the  standpoint  of  effective  use  of 
machine  potentialities.    This  might  mean  that,  ab  initio,  the  customer  would  pose  to  the 
system  a  search  query  request  not  couched  in  his  notion  of  words  or  terms  actually  used 
in  the  system,  but  either  (a)  an  outline  or  statement  of  his  own  research  proposal  and 
plan  of  attack  or  (b)  an  indication  of  one  or  several  items  that  he  has  already  decided  are 
I  pertinent  to  his  interests,  with  a  request  for  "more  like  these". 

1  An  equally  radical  departure  from  conventional  present  habits  and  thinking  is  already 

I  implicit  in  Needham's  suggestion  of  an  automatically  derived  classification  system  and 
'  manual  assignments  thereto.  Zj  It  would  attack  present-day  machine  capacity  and  proces- 
sing time  limitations  such  that  property  and  class  or  category  associations  must  be  held  to 
I   something  less  than  1,  000  x  1,  000,  unless  prohibitive  processing  costs  are  to  be  incurred. 
!  This  approach  would  assume  a  one-time  large-scale  building  of  vocabulary  and  term  or 
'  category  associations  and  derivation  of  assignment  algorithms,  and  the  printing  out  of  the 
results  in  multiple  copies  for  use  by  low-level  clerical  personnel  carrying  out,  indeed, 
"machine -like"  indexing. 

I, 

1  A  final  promising  approach  to  the  future  prospects  for  fully  automatic  indexing  and 

*  categorization  is  the  perseverance  in  research  and  development  efforts  in  advance  of  the 


_!/  See,  for  example,  Sebesyten,  1961  [539],  1962  [538]. 
!  2/       Needham,   1963  [432],  p.  1. 


181 


advent  of  versatile  character  readers  and  inexpensive,  very  large  capacity,  rapid  direct 
access  memories.    These  efforts  will  include  not  only  further  systematic  exploration  of 
syntactic,  semantic  and  pragmatic  considerations  in  linguistic  data  processing,  but  also 
further  attacks  on  the  problems  of  language  and  meaning  themselves.    Thus,  we  may  con- 
clude with  Maron  that:    "automatic  indexing  represents  the  opening  wedge  in  a  general  attack 
at  not  only  the  problems  of  identification  search  and  retrieval,  but  also  the  problem  of 
automatically  transforming  information  on  the  basis  of  its  content.  "1/ 

If  we  are  to  attempt  to  solve  this  problem,  as  indeed  we  should,  must  we  not  look 
forward  to  the  possibilities  of  rapid  up-dating,  thesaurus  growth  and  revision,  and  quick 
and  economical  re-indexings  of  entire  collections  that  only  machine-processing  capabilities 
can  promise  today? 
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APPENDIX  B.    PROGRESS  AND  PROSPECTS  IN  MECHANIZED  INDEXING 


A  working  paper  prepared  for  the  Symposium  on 
Mechanized  Abstracting  and  Indexing,  Moscow, 
28  September  -  1  October  1966 

Mary  Elizabeth  Stevens 
National  Bureau  of  Standards 
Washington,  D.  C.  20234 


The  term  mechanized  indexing  can  be  interpreted  in  two  different  ways:    as  involving 
the  use  of  machines  to  produce  indexes  once  the  index  entries  have  been  pre-determined 
manually,   or  as  involving  the  use  of  machines  to  select  the  index  entries  as  well  as  to 
prepare  the  indexes. 

The  first  interpretation,  that  of  machine  compilation  of  indexes  is  perhaps  best 
represented  by  the  progressively  more  sophisticated  mechanization  used  for  the  production 
of  Index  Medicus  from  manual  "shingling",  through  sequential  card  camera  operations,  to 
the  computer-based  system  using  a  high-speed  phototypesetter,  the  Photon  GRACE     1,  2/. 
As  noted  elsewhere  in  this  report,  machine  capabilities  have  made  practical  the  prepara- 
tion of  citation  indexes.    In  general,  however,  machine -compiled  indexes  work  with  the 
results  of  human  intellectual  efforts  as  applied  in  the  subject  content  analysis  of  documents. 
We  also  find  machines  used  to  provide  aids  to  the  indexer.     Two  different  tools  may  be 
employed  to  improve  the  quality  of  indexing.    There  are  prescriptive  aids  in  the  sense  of 
limiting  and  rigorously  defining- the  scope  of  index  terms  to  be  used,   and  there  are 
suggestive  aids  in  the  sense  of  provoking  ideas  about  additional  terms  that  might  be  used. 

The  first  type  may  involve  a  mechanized  authority  list  or  thesaurus  used  to  normalize 
proposed  index  term  entries,  as  has  been  demonstrated  by  Schultz  3/  and  Schultz  and 
Shepherd  4/  from  I960  onward.     The  potential  value  of  this  technique  is  indicated  by  further 
investigations  of  Schultz  et  al  _5/  in  which  it  was  found  that  index  terms  proposed  by  authors 
agreed  more  with  terms  employed  by  more  than  one  member  of  a  typical  user  group  than 
did  terms  available  in  the  document  titles.    Another  example  of  developments  in  the  use  of 
a  mechanized  thesaurus  is  the  system  at  Lockheed  Missiles  and  Space  Division,  Palo 
Alto  bl . 

This  type  of  tool  is  used  to  check  proposed  indexing  terms  against  the  terms  of  the 
system  vocabulary,  to  prescribe  choices  between  synonyms  and  different  levels  of  spec- 
ificity, and  to  supply  syndectic  devices  such  as  "see  also"  references.  Computer 
manipulations  of  thesauri  can  also  be  used  to  diversify  search  questions  and  to  provide 
useful  groupings  of  terms  previously  used  in  the  system.    The  mechanized  thesaurus  can 
thus  serve  as  the  second  type  of  aid  by  suggesting  to  the  human  indexer  additional  terms  he 
might  use.    In  effect,   such  a  thesaurus  provides  a  display  of  prior  term-term,  document- 
term  and  document-document  associations  observed  in  a  particular  collection,   such  as  was 
demonstrated  in  the  form  of  special  purpose  equipment  in  Taube's  "EDIAC"  7/  and  the 
"ACORN"  devices  at  A.  D.  Little  8/.  ~ 

The  associational  thesaurus  can  also  be  used  to  aid  in  the  resolution  of  ambiguities  of 
natural  language  and  to  provide  for  updating  in  the  light  of  changing  terminologies  or 
changes  in  the  subject  scope  of  a  collection.    What  are  the  prospects  for  automatic  updating 
and  revision  of  a  mechanized  thesaurus?    Luhn  9^/  has  suggested  that  a  record  of  the  num- 
ber of  times  words  and  groups  are  looked  up  would  be  "an  indispensable  part  of  the  system 
for  making  periodic  adjustments  based  on  the  usage  of  words  or  notions  as  mechanically 
established.  " 

Another  suggestion  for  the  development  of  mechanized  aids  in  human  indexing  proce- 
dures has  been  made  by  Markus  \0_l .     This  is  to  "explore  the  possibility  of  applying 
programmed  teaching  to  indexing,  with  or  without  machines.  " 

Machine -compiled  indexes  rest  upon  the  efficacy  of  human  indexing  and  there  is 
increasing  reason  to  doubt  that  this  will  be  "good  enough"  for  the  future.    It  appears  that 
there  is  a  growing  consensus  with  respect  to  inadequacies  of  present  scope  and  coverage 
of  indexing  services.    Cheydleur  JjJ^/  emphasizes  that:    "The  cost  of  manual  classification 
and  abstracting  of  all  the  articles  in  the  world's  hundred-thousand  technical  periodicals 
would  be  fantastic.     The  practicality  of  carrying  it  out  in  a  coordinated  and  timely  way  by 
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manual  methods  is  unrealizable.     There  is  also  a  pressing  need  to  extend  the  coverage  of  a 
myriad  of  unpublished  working  papers.     Hence,  there  is  an  utter  necessity  for  automatic 
indexing,  abstracting,  and  summarization  by  electronic  data  processors.  " 


Secondly,  little  confidence  can  be  attached  to  routine,  manual  operations  to  produce 
subject-content  selection  indicia  for  subsequent  selection  and  retrieval  of  stored  documen- 
tary items  for  the  following  reasons: 

1.  Wide  variations  of  intra-  and  inter -analyst  consistencies  occur  in  the 
assignment  of  content-indicia,  even  with  respect  to  well-established  client- 
interests  and  index  term  vocabularies. 

2.  Potential  clients  may  or  may  not  be  inclined  to  use  the  system,  regardless  of 
whether  or  not  it  provides  efficient  content-indicator-clue  and  selection 
criteria  mechanisms. 


3.      Future  queries  cannot,  in  general,  be  effectively  predicted  in  advance,  except 
for  the  cases  of  specific  author  or  title  retrieval  requests. 


The  problem  of  intra-indexer  and  inter -indexer  inconsistency  is  of  special  interest 
because  the  degree  of  inconsistency  will  seriously  affect  search  and  retrieval  effectiveness 
and  because  serious  questions  are  raised  with  respect  to  the  evaluation  of  any  indexing 
system  in  terms  of  prior  or  independent  human  indexing. 

With  respect  to  the  effect  of  indexer  inconsistency  upon  subsequent  search  effec- 
tiveness,  O'Connor  J^/  considers  the  possibilities  of  overassignment    (i.e.,  the  assign- 
ment of  indexing  terms  to  an  item  that  a  subsequent  searcher  would  not  consider  pertinent 
to  that  item)  in  the  case  where  a  search  is  specified  by  index  terms  A,  B  and  C,  each  term 
is  over-assigned  with  ratio  1.  0,  and  assignments  and  overassignments  by  the  recognition 
rules  are  statistically  independent:    "Then  only  one  eighth  of  the  papers  selected  by  the 
conjunction  of  A,  B  and  C  would  correctly  have  all  three  terms.  " 


The  complementary  disadvantage  of  mis  sing  relevant  references  on  search,  because 
of  indexer  failure  to  supply  all  the  appropriate  indexing  terms  that  a  searcher  would  have 
considered  relevant  to  a  particular  document  would  imply  that,  for  a  three -term  query, 
assuming  independence  of  term-assignments  and  a  consistency  level  of  50  percent,  only 
12.  5  percent  of  the  documents  that  the  searcher  would  consider  relevant  would  be  retrieved 
if  someone  else  had  indexed  these  items. 


We  have  previously  reported  13/  on  the  results  of  700  simulated  3-term  searches 
based  upon  both  manual  and  machine  indexing  of  approximately  20  items  with  respect  to  a 
fixed  vocabulary  of  less  than  100  allowed  descriptors.     These  results  show,  that  if  indexer 
A  assigns  to  a  given  document  the  term  "A"  as  indicative  of  subject  content,  then  his  sub- 
sequent chances  of  retrieving  that  document  with  a  query  for  term  "A"  are  58.4  percent  if 
the  item  had  been  indexed  by  someone  other  than  himself,   and  55.  8  percent  if  indexed  by  an 
automatic  indexing  procedure  developed  at  NBS,  called  SADSACT"  (Self -As signed 
Descriptors  from  Self  And  Cited  Titles)  14/.    For  three-term  searches,  any  one  searcher 
would  be  able  to  retrieve  26.4  percent  of  the  items  he  would  consider  relevant  to  his  query 
if  they  had  been  indexed  by  any  of  the  other  user-indexers,  and  24.  7  percent  if  the  items 
had  been  indexed  by  the  machine  technique. 

Tinker  15/  provides  evidence  on  the  relationships  between  inter -indexer  inconsistency 
and  retrieval  efficiency,  assuming  that  a  given  indexer  is  a  potential  querist,  with  average 
chances  of  retrieval  ranging  from  6.  5  to  36  percent.    Additional  evidence  on  the  generally 
vmsatisfactory  state  of  manual  indexing  consistency  has  been  reported  as  follows: 
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1.  Korotkin  and  Oliver  J_6/  report  that  five  psychologists  and  five  non- 
psychologists  indexed  30  items  with  three  descriptors  per  item.     The  task 
was  repeated  two  weeks  later  with  the  aid  of  an  alphabetized  list  of  "sug- 
gested" descriptors  derived  from  the  data  acquired  in  the  first  session. 
Mean  percent  consistency  results  were  as  follows: 

Session  I        Session  II 
Group  A  (Psychologists)  39.0%  53.0% 

Group  B  (Non-psychologists)  36.4%  54.0% 

2.  Evaluations  of  relevancy  of  selected  items  to  a  given  search  request  have 
been  explored  by  Badger  and  Goffman         as  follows:    "Each  of  three  eval- 
uators  was  asked  to  dissect  the  output  into  relevant  and  non-relevant 
subsets.  .  .    A  chi- square  test  was  applied  to  the  observed  evaluation  as 
compared  to  those  expected  if  the  three  evaluators  were  in  complete  agree- 
ment.    The  chi-square  test  of  81.57  was  very  significant,  indicating  that  there 
was  an  absence  of  agreement.  " 

3.  Greer  18/  reports  on  investigations  of  the  interpersonal  agreements  between 
subjects  asked  to  list  the  search  words  they  would  use  in  posing  queries  in 
the  field  of  information  storage  and  retrieval  systems.    He  found  "a  mean 
percentage  consistency  agreement  of  26.  1  among  subjects  in  stating  search 
words . " 

4.  Hammond  _1_9/  provides  a  sampling  of  the  use  by  NASA  (National  Aeronautics 
and  Space  Administration)  and  DDC  (Defence  Documentation  Center)  of  a 
common  set  of  indexing  terms  to  index  an  identical  set  of  996  technical 
reports.     In  considering  3-term  searches  against  the  variant  indexing  shown 
in  Hammond's  tables,   sample  calculations  show  a  25-30  percent  failure  to 
retrieve  potentially  relevant  items. 

5.  In  terms  of  intra -indexer  consistency,  Rodgers  2£/  reports  that:  "A 
consistency  of  .  59  in  selecting  words  to  be  indexed  on  two  different  occasions 
is  not  sufficiently  high  to  give  us  great  confidence  in  expecting  a  stable  store 
when  human  indexers  are  used.  " 

For  these  reasons,  increasing  consideration  should  be  given  to  the  second  interpreta- 
tion of  the  term  "mechanized  indexing",  that  is,  to  machine  generation  of  index  entries,  or 
automatic  indexing.     This  typically  involves  machine  processing  of  some  natural  language 
text,  with  severe  problems  of  input.     The  first  of  several  solutions  involves  use  of 
automatic  character  recognition  techniques  to  convert  printed  text  to  machine -usable  form. 
This  approach  holds  considerable  future  promise,  but  there  are  many  current  limitations 
and  difficulties . 

A  second  possible  solution,  manual  keyboard  operations  to  produce  a  machine -useful 
transcription  of  a  text,  is  plagued  by  high  costs  (i.  e.  ,  at  least  $0.  01  per  word  for  unver- 
ified keypunching),  and  also  by  limitations  of  available  time  or  manpower. 

A  third  alternative  is  suggested  by  current  developments  in  computerized  typesetting 
or  tape-controlled  casting  or  photocomposition  machines.    However,  while  such  techniques 
promise  major  improvements  for  the  automatic  indexing  of  textual  information  to  be  pub- 
lished in  the  future,  little  can  be  done  for  already  available  literature,  even  with  respect  to 
the  bibliographic  citation  information  alone.    Today's  difficulties  are  emphasized  by 
estimates  of  a  cost  of  30  million  dollars  to  convert  the  present  Library  of  Congress  catalog 
to  machine -readable  form  21/. 
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Assuming,  however,   that  the  input  processing  problems  have  been  solved,  we  may  ask 
what  machines  can  do  with  respect  to  words  in  texts,  or  in  portions  of  texts,  that  are  avail- 
able in  machine -useful  form?    The  machines  can  "read"  the  words  for  purposes  of  shifting 
and  sorting  and  can  copy  or  reproduce  the  words  in  some  desired  order,   as  in  a  machine- 
prepared  concordance.    Machines  can  match  input  words  with  words  already  in  store  and 
thus  exclude  input  words  from  further  machine  consideration  (as  by  stoplists  in  KWIC 
(Keyword-in-Context)  and  other  forms  of  derivative  indexing)  or  stress  certain  input  words 
with  reference  to  a  selective  "inclusion"  dictionary. 

Next,  machines  can  tabulate  and  count,   so  that  both  absolute  and  relative  word  fre- 
quency data  may  be  applied  to  either  indexing  or  search-selection  algorithms.  Measure- 
ments of  sequential  distances  between  selected  words  in  the  input  text  may  also  be  applied. 
Machine  look-ups  against  a  master  vocabulary  can  provide  automatic  supplying  of  syndectic 
information,  synonym  reduction,  lexical  normalization,   generic -specific  subsumption, 
data  with  respect  to  previously  observed  word-word  or  word-subject  co-occurrences.  In 
addition,  information  can  be  provided  as  to  the  possible  syntactic  roles  of  input  words. 

In  the  light  of  such  machine  capabilities,  what  can  be  said  of  the  present  state  of  the 
art  in  automatic  indexing?    Automatic  indexing  in  the  sense  of  machine -prepared  indexes 
that  are  generated  by  the  automatic  extraction  and  manipulation  of  keywords,  especially 
from  titles,  is  of  course  widely  used  in  KWIC  indexes  such  as  Chemical  Titles  and  many 
others  both  in  the  United  States  and  elsewhere. 

Fischer  22/  provides  a  retrospective  view  of  KWIC  indexing  concepts,  including 
variants  like  KWOC  (Keyword  out  of  Context)  and  WADEX  (Words  and  Authors  Index  to 
Applied  Mechanics  Review) .    She  stresses  the  potentialities  of  linking  such  extraction 
indexing  to  selective  dissemination  systems  and  concludes:    "Plans  for  using  the  'Echo' 
satellites  to  link  information  centers  around  the  world,  in  a  world  wide  drive  toward  im- 
mediacy in  information  dispersion,  will  surely  provide  a  place  for  KWIC  indexes  and  for 
the  KWIC  concept.  "    Warheit  23/  also  reports  that  consideration  is  being  given  to  combining 
selective  dissemination  systems  and  KWIC.     Fundamental  questions  remain:    How  useful 
and  how  much  used  are  KWIC  and  other  machine-generated  indexes  based  upon  the  extrac- 
tion of  words  from  a  limited  portion  of  the  author's  own  text? 

These  questions  relate  to  an  important  distinction  between  two  quite  different  types  of 
indexing.     The  distinction  is  that  whereas  "derived"  indexing  takes  as  index  entries  the 
author's  own  words  in  the  title,  the  abstract  or  the  full  text,  in  "assignment"  indexing  an 
index  term,  descriptor,  subject  heading,  or  classification  code  is  assigned  to  a  document 
as  an  indicator  of  content  and  the  term  assigned  does  not  need  to  be  identical  with  any  of 
the  author's  own  words. 

We  can  report  continuing  progress  in  use  of  derivative  indexing  techniques  such  as 
KWIC,  and  also  in  experiments  with  automatic  assignment  indexing  and  automatic  subject 
classification.     Timeliness  of  index  production  is  certainly  one  of  the  major  virtues  of 
KWIC.    A  similar  timeliness  is  promised  for  automatic  assignment  indexing  techniques 
provided  that  requirements  can  be  kept  sufficiently  low  with  respect  both  to  keystroking 
and  computer  processing. 

Intermediate  results  maybe  achieved  by  pre-editing,  normalization,  and  post-editing 
techniques.    Manual  pre-editing  to  modify  and  supplement  keywords  in  title,  abstract,  or 
portions  of  text  has  been  used  in  permuted  title  and  KWIC -type  indexing  from  the  punched 
card  system  that  began  operation  in  1952  24/  to  the  "notation-of-content"  system  developed 
for  NASA  25/.    Kreithen  2_6/  suggests  a  combination  of  derivative  and  assignment  indexing, 
as  follows : 
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"The  combination  of  these  two  automatic  indexing  methods,  whereby  a  number  of 
indexing  terms  would  be  assigned  to  a  document  on  the  basis  of  its  category- 
dependency,  and  the  rest  extracted  from  text,  might  be  a  desirable  solution.  " 

Automatic  assignment  indexing,  with  clue-words  in  the  input  textual  material  used  to 
determine  the  proper  assignments  of  indexing  terms  to  incoming  items,  is  generally  equiv- 
alent to  automatic  classification  techniques  that  assign  a  single  classification  category  to 
items,   again  on  the  basis  of  clue -words  in  the  input  text,  because  a  minimum  cut-off  level 
in  the  automatic  assignment  procedure,   combined  with  a  sufficiently  generic  vocabulary, 
can  achieve  clas sificatory  as  well  as  indexing  results.     The  present  state  of  the  art  in 
automatic  assignment  indexing  and  classification  is  marked  by  intriguing  demonstrations  of 
technical  feasibility  for  the  relatively  small  samples  so  far  investigated.    Present  dif- 
ficulties associated  with  automatic  assignment  indexing  or  classification  techniques, 
however,   relate  to  problems  of  input  processing  requirements,   computational  limitations, 
the  special  purpose  nature  of  results  demonstrated  to  date,  and  problems  of  evaluation. 

A  listing  of  automatic  classification  and  assignment  indexing  experiments  as  of  1964 
is  provided  in  Table  2,  pp.    101-103,   of  the  text  of  this  report.     To  this  we  should  add  more 
recent  results  of  our  own  as  well  as  additional  results  reported  by  O'Connor  27/  and 
WiUiams  28,  29/,  Dale  and  Dale  30,   31/,  and  others. 

In  the  SADSACT  method,  we  start  with  a  "teaching  sample"  of  items  representative  of 
our  collection,  to  which  indexing  terms  have  previously  been  assigned.    We  then  derive  the 
statistics  of  co-occurrences  of  substantive  words  in  the  titles  and  abstracts  of  these  items 
with  descriptors  assigned  to  them,   ending  with  a  vocabulary  of  clue  words  weighted  with 
respect  to  prior  co-occurrences  with  various  descriptors  with  which  they  have  been 
as  sociated. 

Then,  for  new  items,  we  look  up  each  word  of  input  (typically  consisting  of  100  words 
or  less:    title  and  up  to  10  cited  titles,  or  title  and  brief  abstract,  or  title  and  first  or  last 
paragraphs)  and  derive  "descriptor-selection-scores"  based  upon  the  prior  ad  hoc  word- 
descriptor  associations.    The  highest  ranking  descriptors,  in  terms  of  the  accumulated 
selection  scores,  are  then  assigned,  at  some  appropriate  cut-off  level,  to  the  new  item. 

To  date,  machine  first-choice  assignments  (corresponding  to  performance  figures 
reported  for  other  automatic  classification  and  indexing  experiments)  have  been  checked 
for  213  test  items  either  against  prior  DDC  indexing  or  against  user  evaluations,   or  both, 
with  72.  3  percent  mean  overall  agreement. 

Our  most  recent  results  involved  150  test  items.    Machine  assignments  of  descriptors 
to  items  were  checked  by  having  up  to  five  actual  users  of  our  collection  rate  the  relevance 
to  a  given  one  of  14  descriptors  of  items  whose  titles  were  listed  under  that  descriptor  by 
the  machine  assignment  procedure.    A  total  of  451  pairings  of  user -relevance -ratings  with 
the  machine  has  now  been  analyzed,  with  a  mean  relevance  rating  of  74.  9  percent.  With 
respect  to  machine  first-choices,  there  were  206  pairings  with  85.4  percent  of  the  machine 
assignments  rated  as  at  least  somewhat  relevant. 

Checks  have  also  been  made  of  SADSACT  results  as  compared  to  which  of  these  same 
I    documents  would  be  directly  retrievable  if  a  KWIC  or  some  other  title-only  index  were  to 
•    be  used.     For  the  first  50  machine  assignments  rated  as  "highly  relevant"  in  user- 
evaluations,  a  check  was  made  to  determine  whether  or  not  the  same  item  would  be 
retrievable  by  lookup  under  the  name  of  the  descriptor  in  a  KWIC  index.     There  were  9 
such  cases,  or  18  percent.    In  48  percent  of  the  cases,  a  part  of  the  descriptor  name 
'    occurred  in  the  document  title.     For  17  cases,  or  34  percent,  there  were  no  title  words 
identical  with  any  part  of  the  descriptor  name. 
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One  evaluator  was  also  asked  to  review  the  titles  of  150  test  items  and  to  indicate 
which,  if  any,  he  would  wish  to  retrieve  under  each  of  14  descriptors.    He  requested  in  all 
353  items  and  209  of  these  were  retrieved  on  the  basis  of  the  SADSACT  assignments,  for  a 
recall  ratio  of  59.  2  percent.     Of  these,   167  had  been  previously  evaluated  by  the  same  user 
for  an  overall  relevance  ratio  of  81.4  percent. 

Summary  accounts  of  automatic  classification  and  assignment  indexing  experiments 
have  been  provided  by  Schultz  _32/  in  the  form  of  an  "imaginary  panel  discussion"  (in  which, 
hypothetically,  Borko,  Schultz,  and  Stevens  discuss  their  respective  systems),  and  by 
Black  33/  who  concludes:    "Provided  that  overall  effectiveness  is  nearly  equal,  the  system 
that  depends  less  on  the  human  element  would  clearly  seem  to  be  more  desirable  from  a 
standpoint  of  reliability  and  efficiency,  and  perhaps  even  from  a  standpoint  of  economics 
as  well.  " 

Additional  work  has  been  reported  by  Dale  and  Dale  30,   31/,  Damerau  34/,  Dolby  et 
al_35/,  Kreithen^/,   O'Connor  27/,  and  Williams  28,   29  /,  among  others.    Borko's  36,  37/ 
more  recent  papers  on  this  subject  consider  problems  of  reliability  and  evaluation.  He 
reports  comparisons  of  automatic  and  manual  classifications  of  997  psychological  abstracts 
into  11  categories,  factor -analytically  derived  from  65  percent  of  these  abstracts  used  as 
source  items.    He  concluded  that  it  was  possible  to  determine  that  the  percentage  of  agree- 
ment between  automatic  classification  and  perfectly  reliable  human  classification  could 
reach  67  percent. 

O'Connor's  1965  report  J_2/  provides  further  promising  results  of  his  "machine -like 
indexing  by  people"  studies  and  also  discussions  of  other  techniques  and  of  difficulties  and 
limitations  in  automatic  indexing  experiments  to  date.     Using  Merck,  Sharp  and  Dohme 
indexing  data,   O'Connor  tested  additional  recognition-of-clue -word  rules  based  on  syntactic 
emphasis,  a  first  sentence  and  first  paragraph  measure,  a  syntactic-distance  measure, 
negations  forbidden  near  clue  words,  and  words  naming  substances  or  types  of  operations 
being  required  in  close  proximity  to  clue  words. 

He  reports  considerable  success  with  these  new  rules  as  follows:    "The  computer 
rules  selected  92%  of  180  toxicity  papers.    Allowing  for  sampling  error,  these  rules  would 
select  between  88  and  95  percent  of  the  toxicity  papers.    Thus  the  computer  rules  would  be 
roughly  comparable  to,  or  perhaps  superior  to,  MSD  indexers  in  identifying  toxicity 
papers. " 

With  respect  to  the  difficulties  to  be  observed  in  automatic  indexing  experimentation, 
O'Connor  questions  the  adequacy  of  samplings  of  subject  specifications,  documents,  and 
collections,  the  size  of  clue  word  vocabularies,  and  the  human  judgments  used  as  stand- 
ards in  many  of  the  studies  that  have  been  made. 

The  question  of  sampling  adequacy  in  terms  of  the  representativeness  of  clue  word 
vocabularies  as  related  to  index  terms  or  classification  categories  may  be  particularly 
critical  for  methods  using  small  teaching  samples.    Spiegel  and  Bennett  ^8^/  report  that: 
"There  seems  to  be  no  simple  relation  between  the  size  of  the  corpus  and  the  size  of  the 
vocabulary  but  after  a  certain  point  vocabulary  size  increases  very  slowly.  " 

Findings  by  Williams  2S_I  are  encouraging.    Working  with  teaching  samples  of  35,  70, 
and  140  items  respectively,  he  reports  that  in  the  first  10,  000  word  tokens  processed  from 
the  text  of  2,  700  abstracts  1,  800  different  word  types  were  encountered  but  that  in  the 
80,  000  to  90,  000  range  only  255  new  types  appeared.    He  found  further  that  "an  increase  in 
sample  size  beyond  140  would  not  appear  to  offer  any  significant  increase  in  classification 
performance.  " 
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Williams  found  an  average  correct  classification  of  62  percent  for  474  test  items 
automatically  assigned  to  one  of  four  solid  state  categories  2_8 / .    In  other  tests,   2,  754 
solid  state  abstracts  were  classified  into  three  primary  and  three  secondary  categories, 
using  a  computer  program  capable  of  handling  up  to  50  clue  words,   10  subject  categories, 
and  any  number  of  documents.    Performance  effectiveness  ranged  from  62  to  88  percent 
correct  by  comparison  with  the  original  classifications  at  the  more  generic  level  and  from 
67  to  92  percent  correct  at  the  more  specific  level. 

Further  progress  in  the  application  of  statistical  association,   clumping  and  syntactic 
analysis  techniques  have  also  been  reported.    Statistical  association  techniques  are 
concerned  with  correlations  and  coefficients  of  similarity  assumed  to  exist  between  items 
or  objects  sharing  common  properties.    In  documentary  item  applications,  document - 
document  similarities  are  calculated  for  sharings  of  the  same  index  terms  or  for  common 
patterns  of  citing  the  same  references,   of  being  cited  by  the  same  other  documents,  and 
the  like.    Word-association  techniques  include  the  development  of  absolute  or  relative  fre- 
quencies of  co-occurrence  in  a  given  set  of  documents,   such  as  those  representative  of  a 
specific  subject  matter  field.    Various  normalizing  procedures  can  be  used  to  remove 
effects  of  tendencies  for  certain  words  to  occur  frequently  in  general.    Spiegel  and  asso- 
ciates ^8^/  at  Mitre  Corporation  have  explored  means  of  normalization  to  eliminate  effects 
of  length  of  text  strings,   relative  positions  of  words  in  a  string,   and  vocabulary  size. 

Ernst  reports  that  at  Arthur  D.  Little:    "We  are  .  .  .   seeking  to  provide  a  working 

retrieval  system  which  will  incorporate  associative  features.     The  objective  will  be  to 
make  use  of  automatically  computed  index  term  associations  as  a  basis  for  detecting  and 

presenting  an  appropriate  list  of  near-synonyms  for  the  concepts  desired  by  a  user  

essentially  the  automatic  generation  of  a  limited  thesaurus  in  response  to  individual  user 
requests.  "    In  Switzer's  model  40/,   co-occurrence  statistics  of  index  terms  consisting  of 
words  from  title  or  text,   author's  names,   and  words  and  author  names  from  cited  titles, 
are  used.    Significant  probabilities  for  such  co-occurrences  are  then  derived. 

Methods  that  group  objects  or  items  in  terms  of  co-occurrence  data  for  their  prop- 
erties or  characteristics  are  involved  in  the  "clumping"  techniques  as  proposed  at  the 
Cambridge  Language  Research  Unit.     Further  investigations  into  the  development  of  the 
basic  CLRU  approach  have  been  conducted  at  the  Linguistic  Research  Center  at  the  Univer 
sity  of  Texas,  by  Dale  and  others  30,  41/.    In  this  work,   simulation  of  associative  doc- 
'   viment  retrieval  by  computer  gave  results  for  260  computer  abstracts,  using  the  same  90 
clue  words  as  previously  used  by  Borko:    "The  recall  ratios  in  the  test  requests  were  high 
(i.e.  ,  very  few  relevant  documents  were  not  retrieved);  relevance  ratios  were  characteris 
'j   tically  smaller  (of  the  order  of  10  percent).    However,   since  the  output  lists  are  ordered, 
I    it  is  interesting  to  note  that  the  relevance  ratios  are  significantly  much  higher  in  the  upper 
y    portions  of  the  output  lists  (roughly  between  25  percent  and  50  percent  in  the  upper  fourth 
of  the  output  lists),   and  that  recall  ratios  are  still  of  the  order  of  50-70  percent.  " 

^  In  1964  a  report  of  the  Astropower  Laboratory  42/  outlined  a  "semantic  space 

I    screening  model"  based  on  the  assumptions  that  keywords  or  phrases  have  quantifiable 
'values',  that  by  itemizing  the  keywords  in  a  document  sufficient  information  is  obtained 
for  its  classification,  and  that  by  adding  the  values  for  the  keyw-ords  in  a  document  the 
pertinence  of  that  document  to  a  particular  subject  field  can  be  determined.    A  training 
sample  consisted  of  120  abstracts  drawn  from  six  subfields  of  electrical  engineering. 
Results  showed  successful  classification  of  source  items,  using  four  different  classifica- 
tion formulas,  as  ranging  from  49  to  96.  3  percent.    Results  with  test  items  ranged  from 
32.  9  to  69.  0  percent  accuracy. 
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The  automatic  indexing,   selective  dissemination  and  retrieval  system  design  developed 
by  Ossorio  43/  is  based  on  a  system  vocabulary  subsequently  used  for  the  automatic  assign- 
ment of  new  items  to  appropriate  locations  in  a  pre-established  "classification  space".  An 
"attribute  space"  may  also  be  developed  to  identify  the  kind  of  information  found  in  a  doc- 
ument, e.  g.  ,  that  it  deals  with  concepts  such  as  weight  or  physical  size  rather  than  with 
mathematical  or  space  and  time  concepts. 

Both  types  of  "space"  in  this  system  are  constructed  through  the  use  of  factor  analysis 
applied  to  previously  established  relationships  between  the  terms  in  the  system  vocabulary 
(approximately  1,450  terms)  and  49  subject  fields  and  to  relevance  ratings  of  attributes 
with  respect  to  items.    Then,   "documents  are  indexed  by  being  assigned  a  set  of  coor- 
dinates in  the  classification  space  by  means  of  the  clas sification.  Formula  and  the  system 
vocabulary. " 

With  respect  to  the  use  of  linguistic  techniques  in  automatic  indexing  and  classification, 
methods  of  computational  linguistics  may  be  used  to  derive  measures  of  the  probable 
significance  of  words  in  dociiment  texts.    Damerau  34/  reports  experimentation  with  word 
subset  selection  for  indexing  purposes  based  upon  word  occurrence  frequencies  signif- 
icantly larger  than  expected  frequencies  (following  Edmundson  and  Wyllys,  in  part),  with 
encouraging  results.     Findings  by  Black  _3_3/,  Simmons  et  al  44^/ ,  Spiegel  and  Bennett  38/, 
and  Wallace  45/,   among  others,   suggest  the  need  for  continuing  investigations  in  the  area 
of  proper  discrimination  between  significant  clue  words  and  non-informing  words  for  a 
particular  corpus  or  collection.    Extensive  computer  processing  and  analyses  such  as 
Dennis  46/  has  applied  to  the  legal  literature  are  needed  for  other  subject  matter  fields. 
The  latter  investigator  warns  that  neither  raw  word  frequencies  nor  the  numbers  of  doc- 
uments in  which  a  word  occurs  provide  good  criteria  for  distinguishing  between  trivial  or 
non-informing  and  significant  or  informing  words.    She  suggests,  instead,  that  "discrim- 
ination increases  with  the  skewness  of  the  word  distribution  in  the  file". 

Baxendale  has  suggested  that  certain  types  of  phrase  structures  and  nominal  construc- 
tions, as  determined  by  relatively  unsophisticated  machine  syntactic  analyses,   are  useful 
in  revealing  appropriate  subject-content  clues.    A  recent  example  is  provided  by  Clarke 
and  Wall  47/:    "The  hypothesis  is  that  the  importance  of  nominal  constructions  in  selection 
of  index  unit  candidates  places  emphasis  on  the  bracketing  of  all  noun  phrases.  " 
Baxendale 's  continuing  work  48/  further  suggests  that  "through  the  methods  of  statistical 
decision  theory  it  is  hoped  to  formulate  quantitative  measures  that  will  separate  inform- 
ative index  terms  from  noninf ormative .  "    Continuing  use  of  syntactic  analysis  principles  is 
provided  as  an  option  in  the  SMART  system  (Salton  _49/)  and  possibilities  for  choosing  index 
terms  automatically  by  syntactic  criteria  have  been  explored  by  Dolby  et  al  35/. 

Closely  related  to  automatic  classification  or  indexing  experiments  involving  linguistic 
factors  are  document  and  word  grouping  investigations  for  homograph  resolution  and  sub- 
ject field  identification  purposes,   such  as  those  of  Doyle  and  Wallace  45/.     Doyle  used 
a  Fortran  computer  program  developed  by  Ward  and  Hook  for  iterative  automatic  groupings 
of  50  physics  and  50  non-physics  documents.    He  was  able  to  show  clear-cut  separation  of 
two  meanings  of  words  such  as  "force"  and  "satellite". 

A  case  involving  overlaps  of  word  memberships  in  more  than  one  subject  class  has 
been  investigated  by  Wallace  45/.     Using  word  frequency  data,  he  found  48  words  in  com- 
mon on  the  first  100  word-frequency  rankings  for  psychological  and  computer  literature 
abstracts,  with  function  words  predominating.    However,  using  a  word  rank  sum  criterion, 
he  was  able  to  separate  50  psychological  abstracts  from  50  computer  abstracts  with  78 
percent  success. 
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We  may  thus  conclude  that  the  progress  and  prospects  of  automatic  indexing,  as  of 
September  1966,  are  both  provocative  and  challenging.     They  are  "provocative "  because  so 
much  in  terms  of  both  practical  and  theoretical  accomplishment  has  already  been  dem- 
onstrated, and  "challenging"  because  so  much  remains  to  be  done.    Further,  what  remains 
to  be  done  will  in  all  probability  require  serious,  intensive,  and  imaginative  investigations 
of  a  wide  variety  of  questions  from  the  relative  usage  and  acceptability  of  a  KWIC  index 
through  possible  changes  in  author  and  editor  practices  to  the  fundamental  questions  of 
semantics  and  human  judgment. 

Nevertheless,  when  the  results  of  automatic  classification  or  automatic  indexing 
procedures  reach  levels  of  70  percent  or  better  mean  agreement  either  with  human  in- 
dexers  or  with  potential  users  evaluating  the  relevance  of  items  retrieved  by  such  indexing, 
then  the  machine  methods  should  be  preferred  to  routine,  run-of-the-mill,  manual  indexing 
wherever  the  costs  are  at  least  commensurate.  / 

The  technical  feasibility  of  achieving  such  performance  levels  for  a  relatively  small 
number  of  classification  categories  or  a  relatively  small  vocabulary  of  index  terms  has 
already  been  demonstrated  experimentally.    There  remain  unresolved  questions  of  the 
extent  to  which  it  will  be  possible  to  apply  such  techniques  to  the  larger  vocabulary  require - 
ments  and  the  practical  operating  considerations  in  actual  collections. 

As^iiming  that  we  can  solve  these  problems,  however,  many  advantages  will  accrue. 

First  is  the  speed  with  which  many  items  can  be  indexed  in  a  few  minutes  or  hours  at 

most  for,   say,    10,  000  items.    Secondly,  there  are  advantages  of  timeliness  and  the  ease 
with  which  an  entire  collection  can  be  re-indexed  or  re-classified.    A  third  advantage  is 
the  consistency  of  the  machine  procedures,   especially  as  compared  with  the  inconsistency 
to  be  noted  in  available  data  on  tests  of  comparative  performance  among  indexers. 

The  advantage  of  ability  to  re-index  quickly,   easily,  and  inexpensively  (because  most 
input  costs  will  have  been  incurred  previously)  is  of  major  importance  in  terms  of  over- 
coming present  barriers  to  the  introduction  of  improvements  in  operating  systems  (since, 
as  Kyle  51_/  points  out,    "The  most  common  reason  for  not  trying  new  and/or  improved 
techniques  of  classification  and  indexing  is  the  difficulty  of  reclassifying  and  re-indexing 
large  collections")  and  in  terms  of  dynamic  revision  and  up-dating  (as  Borko  37/ 
emphasizes ) . 

Another  advantage,  particularly  of  methods  using  teaching  samples  is  (as  suggested  by 
Mooers  as  early  as  1959  52/),  the  capability  for  making  assignments  of  indexing  terms  in, 
say,   an  English  language  system  to  items  whose  texts  are  written  in  other  languages: 
French,   German,  or  Russian.     This  type  of  advantage  can  point  the  way  to  greater  interna- 
tional collaboration  in  indexing  and  document  control  procedures. 

A  further  possibility  is  suggested  by  the  convergence  of  automatic  indexing  techniques 
based  upon  teaching  samples  with  adaptive  selective  dissemination  systems  and  client  feed- 
back possibilities,  especially  those  involving  "more -like -this  !"  requests.    If  we  assume  a 
large-scale,  multiple -access  system  with  adequate  personalized  files  for  the  typical  client, 
the  common  data  bank  of  document  identificatory  and  selection  criteria,  condensed  rep- 
resentations, and  full  text  (if  available)  can  be  selectively  accessed  by  him  on  the  basis  of 
automatic  indexing  generated  by  his  own  choice  of  selection  criteria  and  his  own  choice  of 
exemplar  items  for  each  such  criterion. 
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He  may  provide  a  standing-order  interest  profile  with  respect  to  patterns  of  his  own 
selection  criteria,  with  weighting  indications  as  to  relative  degrees  of  interest.  Dynamic 
re -adjustments  to  standing  requests  and  weightings  can  be  made  in  accordance  both  with 
his  responses  to  notifications  and  with  any  "more -like-this  "  requests  received  from  him. 
System  accounting  and  usage  statistics  can  provide  a  feedback  warning  system  as  to  the 
adequacy  of  his  selection-criteria  set  and  enable  him  to  initiate  re-processing  of  those 
documents  in  the  collection  likely  to  be  of  current  interest  to  him. 

We  must  close,  however,  with  a  caveat:    if  machines  have  not  yet  mastered  us,  neither 
have  we  yet  the  requirements  of  the  machine  to  the  degree  of  advanced  planning  that  will  be 
required,  especially  for  those  information  processing  operations  involving  the  analysis  of 
content  and  not  merely  the  manipulation  of  records:    for  here  we  are  faced  with  the  great 
challenges  of  human  communication,  human  decision-making,  and  hviman-problem-solving. 
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